image-3.png

The Thera Bank - Credit Card Users Churn Prediction

General Overview¶

Background & Context¶

The Thera Bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.

Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas

We need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards

Objective¶

  • Explore and visualize the dataset.
  • Build a classification model to predict if the customer is going to churn or not
  • Optimize the model using appropriate techniques
  • Generate a set of insights and recommendations that will help the bank

Data Dictionary:¶

  • CLIENTNUM: Client number. Unique identifier for the customer holding the account
  • Attrition_Flag: Internal event (customer activity) variable - if the account is closed then "Attrited Customer" else "Existing Customer"
  • Customer_Age: Age in Years
  • Gender: Gender of the account holder
  • Dependent_count: Number of dependents
  • Education_Level: Educational Qualification of the account holder - Graduate, High School, Unknown, Uneducated, College(refers to a college student), Post-Graduate, Doctorate.
  • Marital_Status: Marital Status of the account holder
  • Income_Category: Annual Income Category of the account holder
  • Card_Category: Type of Card
  • Months_on_book: Period of relationship with the bank
  • Total_Relationship_Count: Total no. of products held by the customer
  • Months_Inactive_12_mon: No. of months inactive in the last 12 months
  • Contacts_Count_12_mon: No. of Contacts between the customer and bank in the last 12 months
  • Credit_Limit: Credit Limit on the Credit Card
  • Total_Revolving_Bal: The balance that carries over from one month to the next is the revolving balance
  • Avg_Open_To_Buy: Open to Buy refers to the amount left on the credit card to use (Average of last 12 months)
  • Total_Trans_Amt: Total Transaction Amount (Last 12 months)
  • Total_Trans_Ct: Total Transaction Count (Last 12 months)
  • Total_Ct_Chng_Q4_Q1: Ratio of the total transaction count in 4th quarter and the total transaction count in 1st quarter
  • Total_Amt_Chng_Q4_Q1: Ratio of the total transaction amount in 4th quarter and the total transaction amount in 1st quarter
  • Avg_Utilization_Ratio: Represents how much of the available credit the customer spent

Sanity Checks¶

Importing Necessary Libraries¶

In [1]:
#
# Loading Necessary Libraries
#

# To help with reading and manipulating data
import pandas as pd
import numpy as np

# To help with data visualization
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(
    color_codes=True
)  # -----This adds a background color to all the plots created using seaborn

# Allow the use of Display via interactive Python
from IPython.display import display

# Import tabulate. library used for creating tables in a visually appealing format.
from tabulate import tabulate

# Import library for exploratory visualization of missing data.
import missingno as ms

# To be used for missing value imputation
from sklearn.impute import SimpleImputer

# To help with model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
    AdaBoostClassifier,
    GradientBoostingClassifier,
    RandomForestClassifier,
    BaggingClassifier,
)
from xgboost import XGBClassifier

# To get different metric scores, and split data
from sklearn import metrics
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    roc_auc_score,
    plot_confusion_matrix,
)

# To be used for data scaling and one hot encoding
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder

# To be used for tuning the model
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# To be used for creating & personalizing pipelines
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from imblearn.pipeline import Pipeline as imb_Pipeline
from imblearn.pipeline import make_pipeline as make_imb_pipeline

# ---To be used to transform User Defined Functions into a transformer function
from sklearn.preprocessing import FunctionTransformer


from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.under_sampling import NearMiss
from imblearn.over_sampling import RandomOverSampler

# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)

# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)

# To supress warnings
import warnings

warnings.filterwarnings("ignore")

# Making the Python code more structured automatically
%load_ext nb_black

print("Loading Libraries... Done.")
Loading Libraries... Done.

Loading the Dataset¶

In [2]:
# Loading Dataset
data_path = "BankChurners.csv"
data = pd.read_csv(data_path)

# Making a copy of the data to avoid any changes to original data
df = data.copy()

print("Loading Dataset... Done.")
Loading Dataset... Done.

Data Overview¶

Checking a few rows of the Dataset¶

In [3]:
# Checking the top 5, botom 5 and 10 random rows

display(df.head())  # -----looking at head (top 5 observations)
display(df.tail())  # -----looking at tail (bottom 5 observations)
display(
    df.sample(10, random_state=1)
)  # -----10 random sample of observations from the data
CLIENTNUM Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
0 768805383 Existing Customer 45 M 3 High School Married $60K - $80K Blue 39 5 1 3 12691.000 777 11914.000 1.335 1144 42 1.625 0.061
1 818770008 Existing Customer 49 F 5 Graduate Single Less than $40K Blue 44 6 1 2 8256.000 864 7392.000 1.541 1291 33 3.714 0.105
2 713982108 Existing Customer 51 M 3 Graduate Married $80K - $120K Blue 36 4 1 0 3418.000 0 3418.000 2.594 1887 20 2.333 0.000
3 769911858 Existing Customer 40 F 4 High School NaN Less than $40K Blue 34 3 4 1 3313.000 2517 796.000 1.405 1171 20 2.333 0.760
4 709106358 Existing Customer 40 M 3 Uneducated Married $60K - $80K Blue 21 5 1 0 4716.000 0 4716.000 2.175 816 28 2.500 0.000
CLIENTNUM Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
10122 772366833 Existing Customer 50 M 2 Graduate Single $40K - $60K Blue 40 3 2 3 4003.000 1851 2152.000 0.703 15476 117 0.857 0.462
10123 710638233 Attrited Customer 41 M 2 NaN Divorced $40K - $60K Blue 25 4 2 3 4277.000 2186 2091.000 0.804 8764 69 0.683 0.511
10124 716506083 Attrited Customer 44 F 1 High School Married Less than $40K Blue 36 5 3 4 5409.000 0 5409.000 0.819 10291 60 0.818 0.000
10125 717406983 Attrited Customer 30 M 2 Graduate NaN $40K - $60K Blue 36 4 3 3 5281.000 0 5281.000 0.535 8395 62 0.722 0.000
10126 714337233 Attrited Customer 43 F 2 Graduate Married Less than $40K Silver 25 6 2 4 10388.000 1961 8427.000 0.703 10294 61 0.649 0.189
CLIENTNUM Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
6498 712389108 Existing Customer 43 F 2 Graduate Married Less than $40K Blue 36 6 3 2 2570.000 2107 463.000 0.651 4058 83 0.766 0.820
9013 718388733 Existing Customer 38 F 1 College NaN Less than $40K Blue 32 2 3 3 2609.000 1259 1350.000 0.871 8677 96 0.627 0.483
2053 710109633 Existing Customer 39 M 2 College Married $60K - $80K Blue 31 6 3 2 9871.000 1061 8810.000 0.545 1683 34 0.478 0.107
3211 717331758 Existing Customer 44 M 4 Graduate Married $120K + Blue 32 6 3 4 34516.000 2517 31999.000 0.765 4228 83 0.596 0.073
5559 709460883 Attrited Customer 38 F 2 Doctorate Married Less than $40K Blue 28 5 2 4 1614.000 0 1614.000 0.609 2437 46 0.438 0.000
6106 789105183 Existing Customer 54 M 3 Post-Graduate Single $80K - $120K Silver 42 3 1 2 34516.000 2488 32028.000 0.552 4401 87 0.776 0.072
4150 771342183 Attrited Customer 53 F 3 Graduate Single $40K - $60K Blue 40 6 3 2 1625.000 0 1625.000 0.689 2314 43 0.433 0.000
2205 708174708 Existing Customer 38 M 4 Graduate Married $40K - $60K Blue 27 6 2 4 5535.000 1276 4259.000 0.636 1764 38 0.900 0.231
4145 718076733 Existing Customer 43 M 1 Graduate Single $60K - $80K Silver 31 4 3 3 25824.000 1170 24654.000 0.684 3101 73 0.780 0.045
5324 821889858 Attrited Customer 50 F 1 Doctorate Single abc Blue 46 6 4 3 1970.000 1477 493.000 0.662 2493 44 0.571 0.750

Observations

  • We will drop CLIENTNUM column. It's adding no value to our analysis and models

  • Attrition_Flag is our target variable and it will be converted to 0s and 1s.

  • Income_Category has an entry with value 'abc' for record 5324. We need to investigate this further.

Checking the shape of the dataset¶

In [4]:
# -----Print the dimension of the data
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns")
There are 10127 rows and 21 columns

Checking the Data Types & General Information of the Dataset¶

In [5]:
# -----Displaying information about features of the Dataset
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10127 entries, 0 to 10126
Data columns (total 21 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   CLIENTNUM                 10127 non-null  int64  
 1   Attrition_Flag            10127 non-null  object 
 2   Customer_Age              10127 non-null  int64  
 3   Gender                    10127 non-null  object 
 4   Dependent_count           10127 non-null  int64  
 5   Education_Level           8608 non-null   object 
 6   Marital_Status            9378 non-null   object 
 7   Income_Category           10127 non-null  object 
 8   Card_Category             10127 non-null  object 
 9   Months_on_book            10127 non-null  int64  
 10  Total_Relationship_Count  10127 non-null  int64  
 11  Months_Inactive_12_mon    10127 non-null  int64  
 12  Contacts_Count_12_mon     10127 non-null  int64  
 13  Credit_Limit              10127 non-null  float64
 14  Total_Revolving_Bal       10127 non-null  int64  
 15  Avg_Open_To_Buy           10127 non-null  float64
 16  Total_Amt_Chng_Q4_Q1      10127 non-null  float64
 17  Total_Trans_Amt           10127 non-null  int64  
 18  Total_Trans_Ct            10127 non-null  int64  
 19  Total_Ct_Chng_Q4_Q1       10127 non-null  float64
 20  Avg_Utilization_Ratio     10127 non-null  float64
dtypes: float64(5), int64(10), object(6)
memory usage: 1.6+ MB

Observations

  • There are missing values in the following columns: Education_Level & Marital_Status.
  • 5 Columns are of type Float, 10 are of type Integer and 6 are of type Object.

Getting the Statistical Summary for the Dataset¶

For Numerical Variables

In [6]:
# -----Displaying Statistical Summary of Numerical Data
df.describe().T
Out[6]:
count mean std min 25% 50% 75% max
CLIENTNUM 10127.000 739177606.334 36903783.450 708082083.000 713036770.500 717926358.000 773143533.000 828343083.000
Customer_Age 10127.000 46.326 8.017 26.000 41.000 46.000 52.000 73.000
Dependent_count 10127.000 2.346 1.299 0.000 1.000 2.000 3.000 5.000
Months_on_book 10127.000 35.928 7.986 13.000 31.000 36.000 40.000 56.000
Total_Relationship_Count 10127.000 3.813 1.554 1.000 3.000 4.000 5.000 6.000
Months_Inactive_12_mon 10127.000 2.341 1.011 0.000 2.000 2.000 3.000 6.000
Contacts_Count_12_mon 10127.000 2.455 1.106 0.000 2.000 2.000 3.000 6.000
Credit_Limit 10127.000 8631.954 9088.777 1438.300 2555.000 4549.000 11067.500 34516.000
Total_Revolving_Bal 10127.000 1162.814 814.987 0.000 359.000 1276.000 1784.000 2517.000
Avg_Open_To_Buy 10127.000 7469.140 9090.685 3.000 1324.500 3474.000 9859.000 34516.000
Total_Amt_Chng_Q4_Q1 10127.000 0.760 0.219 0.000 0.631 0.736 0.859 3.397
Total_Trans_Amt 10127.000 4404.086 3397.129 510.000 2155.500 3899.000 4741.000 18484.000
Total_Trans_Ct 10127.000 64.859 23.473 10.000 45.000 67.000 81.000 139.000
Total_Ct_Chng_Q4_Q1 10127.000 0.712 0.238 0.000 0.582 0.702 0.818 3.714
Avg_Utilization_Ratio 10127.000 0.275 0.276 0.000 0.023 0.176 0.503 0.999

Observations

  • The average Age of a customer is about 46 and the youngest & oldest are about 26 and 73 respectively.
  • The average number of dependents, Dependent_count, of a customer is 2.35, with the customers with most dependents have 5.
  • The average period of relation with the bank, Months_on_book, is about 36 months. With minimum of 13 months and a maximum of 56 months.
  • Average total transaction amount, Total_Trans_Amt is about 4,404 with minimum and maximum total transaction amounts of 510 & 18,484 respectively.
  • Average total transaction count, Total_Trans_Ct is about 65 with minimum and maximum total transaction amounts of 10 & 139 respectively.
  • It is note-worthy that the minimum value for No. of months inactive in the last 12 months, Months_Inactive_12_mon, is 0. This means that all the customers were active in the last 12 months, i.e. non of the customers have a single month of inactivity.
In [7]:
# -----Displaying the Summary of Categorical Data
df.describe(include=["object"]).T
Out[7]:
count unique top freq
Attrition_Flag 10127 2 Existing Customer 8500
Gender 10127 2 F 5358
Education_Level 8608 6 Graduate 3128
Marital_Status 9378 3 Married 4687
Income_Category 10127 6 Less than $40K 3561
Card_Category 10127 4 Blue 9436
In [8]:
# Get the Categorical Variables (Object types)
cat_cols = df.select_dtypes(["object"])

# Check the unique values of the categorical variables
for i in cat_cols.columns:
    print("Unique values % in", i, "are :")
    print(cat_cols[i].value_counts(normalize=True) * 100)
    print("*" * 50)
    print("\n")
Unique values % in Attrition_Flag are :
Existing Customer   83.934
Attrited Customer   16.066
Name: Attrition_Flag, dtype: float64
**************************************************


Unique values % in Gender are :
F   52.908
M   47.092
Name: Gender, dtype: float64
**************************************************


Unique values % in Education_Level are :
Graduate        36.338
High School     23.385
Uneducated      17.275
College         11.768
Post-Graduate    5.994
Doctorate        5.239
Name: Education_Level, dtype: float64
**************************************************


Unique values % in Marital_Status are :
Married    49.979
Single     42.045
Divorced    7.976
Name: Marital_Status, dtype: float64
**************************************************


Unique values % in Income_Category are :
Less than $40K   35.163
$40K - $60K      17.676
$80K - $120K     15.157
$60K - $80K      13.844
abc              10.981
$120K +           7.179
Name: Income_Category, dtype: float64
**************************************************


Unique values % in Card_Category are :
Blue       93.177
Silver      5.480
Gold        1.145
Platinum    0.197
Name: Card_Category, dtype: float64
**************************************************


Observations

  • The Churn Rate = 16.1 %.
  • There are about 53% Female and 47% Male customers.
  • About 11% Income_Category has the value "abc". This will be fixed.

Checking Missing & Duplicate Values¶

In [9]:
# Checking missing values across each columns

c_missing = pd.Series(df.isnull().sum(), name="Missing Count")  # -----Count Missing

p_missing = pd.Series(
    round(df.isnull().sum() / df.shape[0] * 100, 2), name="% Missing"
)  # -----Percentage Missing


# Combine into 1 Dataframe
missing_df = pd.concat([c_missing, p_missing], axis=1)

# # Display missing info
# display(missing_df)

missing_df.sort_values(by="% Missing", ascending=False).style.background_gradient(
    cmap="YlOrRd"
)
Out[9]:
  Missing Count % Missing
Education_Level 1519 15.000000
Marital_Status 749 7.400000
CLIENTNUM 0 0.000000
Contacts_Count_12_mon 0 0.000000
Total_Ct_Chng_Q4_Q1 0 0.000000
Total_Trans_Ct 0 0.000000
Total_Trans_Amt 0 0.000000
Total_Amt_Chng_Q4_Q1 0 0.000000
Avg_Open_To_Buy 0 0.000000
Total_Revolving_Bal 0 0.000000
Credit_Limit 0 0.000000
Total_Relationship_Count 0 0.000000
Months_Inactive_12_mon 0 0.000000
Attrition_Flag 0 0.000000
Months_on_book 0 0.000000
Card_Category 0 0.000000
Income_Category 0 0.000000
Dependent_count 0 0.000000
Gender 0 0.000000
Customer_Age 0 0.000000
Avg_Utilization_Ratio 0 0.000000
In [10]:
# Visual Exploration of Missing Values
# Plot missing values across each columns
plt.title("Missing Values Graph", fontsize=20)
ms.bar(df)
Out[10]:
<AxesSubplot:title={'center':'Missing Values Graph'}>

Observations

  • 15% values are missig from Education_Level.
  • Marital_Status has 7.4% missing values.
  • All the other featres have no missing values.
In [11]:
# Checking for duplicate records

df.duplicated().sum()
Out[11]:
0

Observations

  • There are no duplicate values.

Pre-EDA Data Wrangling¶

To prevent dataleak, we are going to make a copy of the dataframe specifically for creating our models.

We will treat the corrupt data of Income_Category that contains "abc" as NAN and then use simpleimputer to replace the missing values with the mode. This is strictly for EDA purposes.

The Model copy (model_df) will be used for Model Building and treated differently after splitting to prevent dataleaks.

In [12]:
model_df = df.copy()  # Make a copy of the dataframe to be used for building the models
eda_df = df.copy()  # Make a copy of the dataframe to be used for EDA.
In [13]:
# Replacing the corrupt data containing 'abc' with NAN
eda_df.replace("abc", np.nan, inplace=True)

# creating an instace of the imputer to be used
imputer = SimpleImputer(strategy="most_frequent")

cols_for_impute = ["Education_Level", "Marital_Status", "Income_Category"]

# Fit & transform the imputer
eda_df[cols_for_impute] = imputer.fit_transform(eda_df[cols_for_impute])

# Create a numerical representation of the Target variable (0s & 1s) for EDA purposes
eda_df["Attrition_Flag_01"] = eda_df["Attrition_Flag"].apply(
    lambda x: 1 if x == "Attrited Customer" else 0
)
In [14]:
# Show that the EDA copy of the dataset has been treated while the Model copy is still intact.
print("From EDA Copy of Dataset:\n")
print(eda_df["Income_Category"].value_counts(normalize=True) * 100)
display(eda_df.isna().sum())

print("\nFrom Model Copy of Dataset:\n")
print(model_df["Income_Category"].value_counts(normalize=True) * 100)
display(model_df.isna().sum())
From EDA Copy of Dataset:

Less than $40K   46.144
$40K - $60K      17.676
$80K - $120K     15.157
$60K - $80K      13.844
$120K +           7.179
Name: Income_Category, dtype: float64
CLIENTNUM                   0
Attrition_Flag              0
Customer_Age                0
Gender                      0
Dependent_count             0
Education_Level             0
Marital_Status              0
Income_Category             0
Card_Category               0
Months_on_book              0
Total_Relationship_Count    0
Months_Inactive_12_mon      0
Contacts_Count_12_mon       0
Credit_Limit                0
Total_Revolving_Bal         0
Avg_Open_To_Buy             0
Total_Amt_Chng_Q4_Q1        0
Total_Trans_Amt             0
Total_Trans_Ct              0
Total_Ct_Chng_Q4_Q1         0
Avg_Utilization_Ratio       0
Attrition_Flag_01           0
dtype: int64
From Model Copy of Dataset:

Less than $40K   35.163
$40K - $60K      17.676
$80K - $120K     15.157
$60K - $80K      13.844
abc              10.981
$120K +           7.179
Name: Income_Category, dtype: float64
CLIENTNUM                      0
Attrition_Flag                 0
Customer_Age                   0
Gender                         0
Dependent_count                0
Education_Level             1519
Marital_Status               749
Income_Category                0
Card_Category                  0
Months_on_book                 0
Total_Relationship_Count       0
Months_Inactive_12_mon         0
Contacts_Count_12_mon          0
Credit_Limit                   0
Total_Revolving_Bal            0
Avg_Open_To_Buy                0
Total_Amt_Chng_Q4_Q1           0
Total_Trans_Amt                0
Total_Trans_Ct                 0
Total_Ct_Chng_Q4_Q1            0
Avg_Utilization_Ratio          0
dtype: int64

Observations

  • The EDA copy of the dataset has been treated while the Model copy is still intact and remains untouched.
In [15]:
# CLIENTNUM consists of uniques ID for clients and hence will not add value to the modeling
eda_df.drop(["CLIENTNUM"], axis=1, inplace=True)

Exploratory Data Analysis¶

User Defined Functions

In [16]:
# -----
# User defined function to plot labeled_barplot
# -----


def labeled_barplot(data, feature, perc=False, v_ticks=True, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    if v_ticks is True:
        plt.xticks(rotation=90)

    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage
    plt.show()  # show the plot
In [17]:
# -----
# User defined function to prints the 5 point summary and histogram, box plot,
#   and cumulative density distribution plots
# -----


def summary(data, x):
    """
    The function prints the 5 point summary and histogram, box plot,
    and cumulative density distribution plots for each
    feature name passed as the argument.

    Parameters:
    ----------

    x: str, feature name

    Usage:
    ------------

    summary('age')
    """

    x_min = data[x].min()
    x_max = data[x].max()
    Q1 = data[x].quantile(0.25)
    Q2 = data[x].quantile(0.50)
    Q3 = data[x].quantile(0.75)

    dict = {"Min": x_min, "Q1": Q1, "Q2": Q2, "Q3": Q3, "Max": x_max}
    ldf = pd.DataFrame(data=dict, index=["Value"])
    print(f"5 Point Summary of {x.capitalize()} Attribute:\n")
    print(tabulate(ldf, headers="keys", tablefmt="psql"))

    fig, axs = plt.subplots(nrows=3, ncols=1, figsize=(16, 22))
    sns.set_palette("Pastel1")

    # Histogram
    ax1 = sns.distplot(data[x], color="purple", ax=axs[0])
    ax1.axvline(np.mean(data[x]), color="purple", linestyle="--")
    ax1.axvline(np.median(data[x]), color="black", linestyle="-")
    ax1.set_title(f"{x.capitalize()} Density Distribution")

    # Boxplot
    ax2 = sns.boxplot(
        x=data[x], palette="cool", width=0.7, linewidth=0.6, showmeans=True, ax=axs[1]
    )
    ax2.set_title(f"{x.capitalize()} Boxplot")

    # Cummulative plot
    ax3 = sns.kdeplot(data[x], cumulative=True, linewidth=1.5, ax=axs[2])
    ax3.set_title(f"{x.capitalize()} Cumulative Density Distribution")

    plt.subplots_adjust(hspace=0.4)
    plt.show()
In [18]:
# -----
# User defined function to plot stacked bar chart
# -----


def stacked_barplot(data, predictor, target):
    """
    Print the category counts and plot a stacked bar chart

    data: dataframe
    predictor: independent variable
    target: target variable
    """
    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
        by=sorter, ascending=False
    )
    print(tab1)
    print("-" * 100)
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
        by=sorter, ascending=False
    )
    tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5))
    plt.legend(
        loc="lower left",
        frameon=False,
    )
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.show()
In [19]:
# -----
# User defined function to plot both kde & boxplot of predictor variable wrt target
# -----


def kde_boxplot_wrt_target(data, predictor, target):

    # Create the BoxPlot
    plt.figure(figsize=(15, 5))
    sns.boxplot(data=df, x=target, y=predictor, showmeans=True)
    plt.tight_layout()
    plt.show()

    # Create the KDE plot with hue
    sns.kdeplot(
        data=eda_df,
        x=predictor,
        hue=target,
        fill=True,
    )
    # Add labels and title
    plt.xlabel(predictor)
    plt.ylabel("Density")

Univariate analysis¶

Customer_Age¶

In [20]:
# -----Plot Histogram, Box plot and Cummulative Plot
summary(eda_df, "Customer_Age")
5 Point Summary of Customer_age Attribute:

+-------+-------+------+------+------+-------+
|       |   Min |   Q1 |   Q2 |   Q3 |   Max |
|-------+-------+------+------+------+-------|
| Value |    26 |   41 |   46 |   52 |    73 |
+-------+-------+------+------+------+-------+

Observations

  • The data looks normally distributed.
  • There is one customer with age of 73 years. This is a "valid outlier" because it is possible for someone to actually be that age. This is not an "erroneous outlier".
  • Over 90% of the customers are younger than 60yrs.

Months_on_book¶

In [21]:
# -----Plot Histogram, Box plot and Cummulative Plot
summary(eda_df, "Months_on_book")
5 Point Summary of Months_on_book Attribute:

+-------+-------+------+------+------+-------+
|       |   Min |   Q1 |   Q2 |   Q3 |   Max |
|-------+-------+------+------+------+-------|
| Value |    13 |   31 |   36 |   40 |    56 |
+-------+-------+------+------+------+-------+

Observations

  • The distribution looks normal. The mean is almost equal to median and the data looks well distributed on both sides.
  • There are customers who are assocated with bank for almost 5 years...while some are barely 1 year.
  • There are outliers. Again, these are "valid outliers" because they are possible values. They are not "erroneous outliers".

Credit_Limit¶

In [22]:
# -----Plot Histogram, Box plot and Cummulative Plot
summary(eda_df, "Credit_Limit")
5 Point Summary of Credit_limit Attribute:

+-------+--------+------+------+---------+-------+
|       |    Min |   Q1 |   Q2 |      Q3 |   Max |
|-------+--------+------+------+---------+-------|
| Value | 1438.3 | 2555 | 4549 | 11067.5 | 34516 |
+-------+--------+------+------+---------+-------+

Total_Revolving_Bal¶

In [23]:
# -----Plot Histogram, Box plot and Cummulative Plot
summary(eda_df, "Total_Revolving_Bal")
5 Point Summary of Total_revolving_bal Attribute:

+-------+-------+------+------+------+-------+
|       |   Min |   Q1 |   Q2 |   Q3 |   Max |
|-------+-------+------+------+------+-------|
| Value |     0 |  359 | 1276 | 1784 |  2517 |
+-------+-------+------+------+------+-------+

Observations

  • Mean is greater than median depicting right skewed data.

Avg_Open_To_Buy¶

In [24]:
# -----Plot Histogram, Box plot and Cummulative Plot
summary(eda_df, "Avg_Open_To_Buy")
5 Point Summary of Avg_open_to_buy Attribute:

+-------+-------+--------+------+------+-------+
|       |   Min |     Q1 |   Q2 |   Q3 |   Max |
|-------+-------+--------+------+------+-------|
| Value |     3 | 1324.5 | 3474 | 9859 | 34516 |
+-------+-------+--------+------+------+-------+

Observations

  • *Mean is greater than median depicting right skewed data.
  • From our domain knowledge we know that customers may or may not utlise the entire limit available on card. Again, these are "valid outliers" because they are possible values. They are not "erroneous outliers".
  • The data looks normal for the domain and no treatment required.

Total_Trans_Ct¶

In [25]:
# -----Plot Histogram, Box plot and Cummulative Plot
summary(eda_df, "Total_Trans_Ct")
5 Point Summary of Total_trans_ct Attribute:

+-------+-------+------+------+------+-------+
|       |   Min |   Q1 |   Q2 |   Q3 |   Max |
|-------+-------+------+------+------+-------|
| Value |    10 |   45 |   67 |   81 |   139 |
+-------+-------+------+------+------+-------+

Observations

  • Mean is slightly smaller than median.
  • The data looks somewhat balanced.
  • Few customers have used their card more than the other customers. Again, these are "valid outliers" no treatment required.

Total_Amt_Chng_Q4_Q1¶

In [26]:
# -----Plot Histogram, Box plot and Cummulative Plot
summary(eda_df, "Total_Amt_Chng_Q4_Q1")
5 Point Summary of Total_amt_chng_q4_q1 Attribute:

+-------+-------+-------+-------+-------+-------+
|       |   Min |    Q1 |    Q2 |    Q3 |   Max |
|-------+-------+-------+-------+-------+-------|
| Value |     0 | 0.631 | 0.736 | 0.859 | 3.397 |
+-------+-------+-------+-------+-------+-------+

Observations

  • Mean is slightly greater than median.
  • There are outliers on both sides. Again, these are "valid outliers". The data looks normal for the domain and no treatment required.

Let's see total transaction amount distributed

Total_Trans_Amt¶

In [27]:
# -----Plot Histogram, Box plot and Cummulative Plot
summary(eda_df, "Total_Trans_Amt")
5 Point Summary of Total_trans_amt Attribute:

+-------+-------+--------+------+------+-------+
|       |   Min |     Q1 |   Q2 |   Q3 |   Max |
|-------+-------+--------+------+------+-------|
| Value |   510 | 2155.5 | 3899 | 4741 | 18484 |
+-------+-------+--------+------+------+-------+

Observations

  • Mean is greater than median.
  • The distribution is heavily right skewed.
  • Some customers are likely to spend more on credit card, which makes data ligitimate. Again these are "valid outliers". The data looks normal for the domain and no treatment required.

Total_Ct_Chng_Q4_Q1¶

In [28]:
# -----Plot Histogram, Box plot and Cummulative Plot
summary(eda_df, "Total_Ct_Chng_Q4_Q1")
5 Point Summary of Total_ct_chng_q4_q1 Attribute:

+-------+-------+-------+-------+-------+-------+
|       |   Min |    Q1 |    Q2 |    Q3 |   Max |
|-------+-------+-------+-------+-------+-------|
| Value |     0 | 0.582 | 0.702 | 0.818 | 3.714 |
+-------+-------+-------+-------+-------+-------+

Observations

  • This distribution looks normal, however boxplot reveals outliers on both sides.
  • Again, given the domain knowledge, these are "valid outliers". No treatment needed.

Avg_Utilization_Ratio¶

In [29]:
# -----Plot Histogram, Box plot and Cummulative Plot
summary(eda_df, "Avg_Utilization_Ratio")
5 Point Summary of Avg_utilization_ratio Attribute:

+-------+-------+-------+-------+-------+-------+
|       |   Min |    Q1 |    Q2 |    Q3 |   Max |
|-------+-------+-------+-------+-------+-------|
| Value |     0 | 0.023 | 0.176 | 0.503 | 0.999 |
+-------+-------+-------+-------+-------+-------+

Observations

  • Mean is greater than median.
  • The distribution is right skewed, but interestingly there are no outliers.

Dependent_count¶

In [30]:
# -----Call the label_barplot function to plot the graph
labeled_barplot(eda_df, "Dependent_count", True, False)

Observations

  • Most customers have 3 dependents followed by customers with 2 and then 1.

Total_Relationship_Count¶

In [31]:
# -----Call the label_barplot function to plot the graph
labeled_barplot(eda_df, "Total_Relationship_Count", True, False)

Observations

  • Around 1/4th of customers have 3 relationships with bank. Customers with 4,5 and 6 relationships are almost equal in numbers.

Months_Inactive_12_mon¶

In [32]:
# -----Call the label_barplot function to plot the graph
labeled_barplot(eda_df, "Months_Inactive_12_mon", True, False)

Observations

  • About 45.3% of customers have not used their cards for 3 months and over, in the last 12 months.
  • The data shows that a small fraction of customers (0.3%) are active every month.
  • There are around 1.2% of customers who have not used there cards for 6 months in last 12 months.

Contacts_Count_12_mon¶

In [33]:
# -----Call the label_barplot function to plot the graph
labeled_barplot(eda_df, "Contacts_Count_12_mon", True, False)

Observations

  • Most of the customers (96.1%) have interacted with the bank at least once in the last 12 months.
  • 33.4% customers had interacted 3 times with bank in last 12 months, followed by 31.9% customers who were contacted 2 times.
  • Very small percentage (0.5%) of customers have interacted for 6 times in last 12 months.

Gender¶

In [34]:
# -----Call the label_barplot function to plot the graph
labeled_barplot(eda_df, "Gender", True, False)

Observations

  • There are more Female customers than Male.

Education_Level¶

In [35]:
# -----Call the label_barplot function to plot the graph
labeled_barplot(eda_df, "Education_Level", True, True)

Observations

  • 65% of customers are College educated and higher (College, Graduate, Post-Graduate & Doctorate).
  • Most customers are Graduate followed by High School.

Marital_Status¶

In [36]:
# -----Call the label_barplot function to plot the graph
labeled_barplot(eda_df, "Marital_Status", True, False)

Observations

  • Around 53% customers are married.
  • Next highest population is Single customers (38.9%).

Income_Category¶

In [37]:
# -----Call the label_barplot function to plot the graph
labeled_barplot(eda_df, "Income_Category", True, True)

Observations

  • Most customers earn Less than $40K.
  • Followed by customers earning between $40K to $60K.

Card_Category¶

In [38]:
# -----Call the label_barplot function to plot the graph
labeled_barplot(eda_df, "Card_Category", True, False)

Observations

  • Most customers have the "Blue" Card.
  • There are very few customers using "Gold" and "Platinum" Cards.

Bivariate analysis¶

Correlation Matrix¶

In [39]:
plt.figure(figsize=(15, 7))
sns.heatmap(eda_df.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()

Observations

  • Avg_Open_To_Buy and Credit_Limit have a perfect positive linear relationship between them...this suggests multicollinearity. Most of the ML Algorithms we are going to use are not affected by multicollinearity.

  • The perfect positive linear relationship between Avg_Open_To_Buy and Credit_Limit can mean:

    ◎ Customers are not using their cards.

    ◎ Customers pay off their credit cards quickly.


  • Avg_Open_To_Buy and Avg_Utilization_Ratio have a negative corelation as should be.

  • Customer_Age and Months_on_book have a high correlation. This is to be expected.

  • Total_Trans_Amt is highly correlated with Total_Tran_Ct because usually the amount tends to get higher as the count of transactions grows
In [40]:
# We will draw a pair plot of interesting numerical features from the correlation matrix

features_for_pairplot = [
    "Customer_Age",
    "Months_on_book",
    "Credit_Limit",
    "Total_Revolving_Bal",
    "Avg_Open_To_Buy",
    "Total_Trans_Ct",
    "Avg_Utilization_Ratio",
    "Attrition_Flag_01",
]

# Create a pairplot
sns.pairplot(eda_df[features_for_pairplot], hue="Attrition_Flag_01")

# Display the plot
plt.show()

Observations

  • The Total_Trans_Ct for the Attrited Customers is lower across board...i.e. vs Customer_Age,Months_on_book, Credit_Limit, Total_Revolving_Bal, Avg_Open_To_Buy, & Avg_Utilization_Ratio
  • All other insights are inline with observations from the Correlation Matrix above.

Attrition_Flag vs Gender¶

In [41]:
# Stacked Barplot of Attrition_Flag in comparison to Gender
stacked_barplot(eda_df, "Gender", "Attrition_Flag")
Attrition_Flag  Attrited Customer  Existing Customer    All
Gender                                                     
All                          1627               8500  10127
F                             930               4428   5358
M                             697               4072   4769
----------------------------------------------------------------------------------------------------

Observations

  • There are slightly more female customers that attrited.

Attrition_Flag vs Marital_Status¶

In [42]:
# Stacked Barplot of Attrition_Flag in comparison to Marital_Status
stacked_barplot(eda_df, "Marital_Status", "Attrition_Flag")
Attrition_Flag  Attrited Customer  Existing Customer    All
Marital_Status                                             
All                          1627               8500  10127
Married                       838               4598   5436
Single                        668               3275   3943
Divorced                      121                627    748
----------------------------------------------------------------------------------------------------

Observations

  • There is no significant difference in attrition based on marital status.

Attrition_Flag vs Education_Level¶

In [43]:
# Stacked Barplot of Attrition_Flag in comparison to Education_Level
stacked_barplot(eda_df, "Education_Level", "Attrition_Flag")
Attrition_Flag   Attrited Customer  Existing Customer    All
Education_Level                                             
All                           1627               8500  10127
Graduate                       743               3904   4647
High School                    306               1707   2013
Uneducated                     237               1250   1487
College                        154                859   1013
Doctorate                       95                356    451
Post-Graduate                   92                424    516
----------------------------------------------------------------------------------------------------

Observations

  • Customers with doctorate and post-graduate degrees attrited the most.

Attrition_Flag vs Income_Category¶

In [44]:
# Stacked Barplot of Attrition_Flag in comparison to Income_Category
stacked_barplot(eda_df, "Income_Category", "Attrition_Flag")
Attrition_Flag   Attrited Customer  Existing Customer    All
Income_Category                                             
All                           1627               8500  10127
Less than $40K                 799               3874   4673
$40K - $60K                    271               1519   1790
$80K - $120K                   242               1293   1535
$60K - $80K                    189               1213   1402
$120K +                        126                601    727
----------------------------------------------------------------------------------------------------

Observations

  • Customers with the highest level of income ($20K +) and the lowest level of income (Less than $40K) attrited the most.

Attrition_Flag vs Total_Relationship_Count¶

In [45]:
# Stacked Barplot of Attrition_Flag in comparison to Total_Relationship_Count
stacked_barplot(eda_df, "Total_Relationship_Count", "Attrition_Flag")
Attrition_Flag            Attrited Customer  Existing Customer    All
Total_Relationship_Count                                             
All                                    1627               8500  10127
3                                       400               1905   2305
2                                       346                897   1243
1                                       233                677    910
5                                       227               1664   1891
4                                       225               1687   1912
6                                       196               1670   1866
----------------------------------------------------------------------------------------------------

Observations

  • Customers with only 1 or 2 Bank products Attrited the most (they make up for about 35% of Attrited customers.
  • The more products a Customer subscribes to, the less likely the Customer Attrited.

Total_Revolving_Bal vs Attrition_Flag¶

In [46]:
# KDE & Boxplot with respect to Attrition_Flag
kde_boxplot_wrt_target(eda_df, "Total_Revolving_Bal", "Attrition_Flag")

Observations

  • Attrited Customers generally have a lower Total_Revolving_Bal.

Attrition_Flag vs Credit_Limit¶

In [47]:
# KDE & Boxplot with respect to Attrition_Flag
kde_boxplot_wrt_target(eda_df, "Credit_Limit", "Attrition_Flag")

Observations

  • Credit_Limit does not appear to affect Attrition.

Attrition_Flag vs Customer_Age¶

In [48]:
# KDE & Boxplot with respect to Attrition_Flag
kde_boxplot_wrt_target(eda_df, "Customer_Age", "Attrition_Flag")

Observations

  • Customer_Age does not appear to affect Attrition.

Total_Trans_Ct vs Attrition_Flag¶

In [49]:
# KDE & Boxplot with respect to Attrition_Flag
kde_boxplot_wrt_target(eda_df, "Total_Trans_Ct", "Attrition_Flag")

Observations

  • Attrited Customers generally have a lower Total_Trans_Ct.

Total_Trans_Amt vs Attrition_Flag¶

In [50]:
# KDE & Boxplot with respect to Attrition_Flag
kde_boxplot_wrt_target(eda_df, "Total_Trans_Amt", "Attrition_Flag")

Observations

  • Attrited Customers generally have a lower Total_Trans_Amt.

Avg_Utilization_Ratio vs Attrition_Flag¶

In [51]:
# KDE & Boxplot with respect to Attrition_Flag
kde_boxplot_wrt_target(eda_df, "Avg_Utilization_Ratio", "Attrition_Flag")

Observations

  • Attrited Customers generally have much a lower Avg_Utilization_Ratio.

Attrition_Flag vs Months_on_book¶

In [52]:
# KDE & boxplot with respect to Attrition_Flag
kde_boxplot_wrt_target(eda_df, "Months_on_book", "Attrition_Flag")

Observations

  • Months_on_book does not appear to affect Attrition.

Attrition_Flag vs Total_Revolving_Bal¶

In [53]:
# KDE & Boxplot with respect to Attrition_Flag
kde_boxplot_wrt_target(eda_df, "Total_Revolving_Bal", "Attrition_Flag")

Observations

  • Attrited Customers have much a lower Total_Revolving_Bal.

Attrition_Flag vs Avg_Open_To_Buy¶

In [54]:
# Boxplot with respect to Attrition_Flag
kde_boxplot_wrt_target(eda_df, "Avg_Open_To_Buy", "Attrition_Flag")

Observations

  • Avg_Open_To_Buy does not appear to affect Attrition.

Data Preprocessing¶

In [55]:
# Creating a list of numerical variables
num_features = [
    "Customer_Age",
    "Months_on_book",
    "Total_Relationship_Count",
    "Months_Inactive_12_mon",
    "Contacts_Count_12_mon",
    "Credit_Limit",
    "Total_Revolving_Bal",
    "Avg_Open_To_Buy",
    "Total_Amt_Chng_Q4_Q1",
    "Total_Trans_Amt",
    "Total_Trans_Ct",
    "Total_Ct_Chng_Q4_Q1",
    "Avg_Utilization_Ratio",
]

# Creating a list of categorical variables
cat_features = [
    "Gender",
    "Dependent_count",
    "Education_Level",
    "Marital_Status",
    "Income_Category",
    "Card_Category",
]
In [56]:
# CLIENTNUM consists of uniques ID for clients and hence will not add value to the modeling
model_df.drop(["CLIENTNUM"], axis=1, inplace=True)
In [57]:
# Replacing the corrupt data containing 'abc' with NAN
model_df.replace("abc", np.nan, inplace=True)

Outlier Detection & Treatment¶

As observed in the EDA, the outliers are actually values that can be possible for the bank and her customers.

According to the business problem and domain knowledge, these outliers are possible values, so we will not drop them. They are "valid outliers". The data looks normal for the domain and no treatment required.

Instead we will use Log Transformations to reduce the negative effect of outliers on our models.

Log Transformation¶

In [58]:
# -----
# Perform a log transformation of the numerical columns
# -----

# Creating a copy of the dataset so we always have a copy on the untreated dataset
model_untreated_df = model_df.copy()

# using log transforms
for col in num_features:
    model_df[col] = np.log(model_df[col] + 1)

MinMax Scaling¶

In [59]:
# -----
# Minmax scaling numeric features 
# -----

for col in num_features:
    model_df[col] = MinMaxScaler().fit_transform(model_df[[col]])

Train, Val, Test Split¶

In [60]:
# Dividing dataset into X and y

X = model_df.drop(["Attrition_Flag"], axis=1)
y = model_df["Attrition_Flag"].apply(lambda x: 1 if x == "Attrited Customer" else 0)
In [61]:
# Splitting data into training, validation and test set:
# first we split data into 2 parts; temporary and test
# then we split temporary into; train and val

X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=1, stratify=y
)

# then we split the temporary set into train and validation
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp
)
print(X_train.shape, X_val.shape, X_test.shape)
(6075, 19) (2026, 19) (2026, 19)

Missing Value Treatment¶

In [62]:
# Show the number of missing values in the feature set.
display(X_train.isna().sum())
print("-" * 30)
display(X_val.isna().sum())
print("-" * 30)
display(X_test.isna().sum())
Customer_Age                  0
Gender                        0
Dependent_count               0
Education_Level             928
Marital_Status              457
Income_Category             654
Card_Category                 0
Months_on_book                0
Total_Relationship_Count      0
Months_Inactive_12_mon        0
Contacts_Count_12_mon         0
Credit_Limit                  0
Total_Revolving_Bal           0
Avg_Open_To_Buy               0
Total_Amt_Chng_Q4_Q1          0
Total_Trans_Amt               0
Total_Trans_Ct                0
Total_Ct_Chng_Q4_Q1           0
Avg_Utilization_Ratio         0
dtype: int64
------------------------------
Customer_Age                  0
Gender                        0
Dependent_count               0
Education_Level             294
Marital_Status              140
Income_Category             221
Card_Category                 0
Months_on_book                0
Total_Relationship_Count      0
Months_Inactive_12_mon        0
Contacts_Count_12_mon         0
Credit_Limit                  0
Total_Revolving_Bal           0
Avg_Open_To_Buy               0
Total_Amt_Chng_Q4_Q1          0
Total_Trans_Amt               0
Total_Trans_Ct                0
Total_Ct_Chng_Q4_Q1           0
Avg_Utilization_Ratio         0
dtype: int64
------------------------------
Customer_Age                  0
Gender                        0
Dependent_count               0
Education_Level             297
Marital_Status              152
Income_Category             237
Card_Category                 0
Months_on_book                0
Total_Relationship_Count      0
Months_Inactive_12_mon        0
Contacts_Count_12_mon         0
Credit_Limit                  0
Total_Revolving_Bal           0
Avg_Open_To_Buy               0
Total_Amt_Chng_Q4_Q1          0
Total_Trans_Amt               0
Total_Trans_Ct                0
Total_Ct_Chng_Q4_Q1           0
Avg_Utilization_Ratio         0
dtype: int64

To prevent data leak, we will implement the fit model on training and transform is applied on training, validation and test separately. That is, what is learned in training is imputed across board (train, val & test). That way the val & test sets remain 'unseen' by train

In [63]:
# creating an instace of the imputer to be used
imputer = SimpleImputer(strategy="most_frequent")

cols_for_impute = ["Education_Level", "Marital_Status", "Income_Category"]

# Fit and transform the train data
X_train[cols_for_impute] = imputer.fit_transform(X_train[cols_for_impute])

# Transform the validation data
X_val[cols_for_impute] = imputer.transform(X_val[cols_for_impute])

# Transform the test data
X_test[cols_for_impute] = imputer.transform(X_test[cols_for_impute])
In [64]:
# Verify that the missing values have been treated.
display(X_train.isna().sum())
print("-" * 30)
display(X_val.isna().sum())
print("-" * 30)
display(X_test.isna().sum())
Customer_Age                0
Gender                      0
Dependent_count             0
Education_Level             0
Marital_Status              0
Income_Category             0
Card_Category               0
Months_on_book              0
Total_Relationship_Count    0
Months_Inactive_12_mon      0
Contacts_Count_12_mon       0
Credit_Limit                0
Total_Revolving_Bal         0
Avg_Open_To_Buy             0
Total_Amt_Chng_Q4_Q1        0
Total_Trans_Amt             0
Total_Trans_Ct              0
Total_Ct_Chng_Q4_Q1         0
Avg_Utilization_Ratio       0
dtype: int64
------------------------------
Customer_Age                0
Gender                      0
Dependent_count             0
Education_Level             0
Marital_Status              0
Income_Category             0
Card_Category               0
Months_on_book              0
Total_Relationship_Count    0
Months_Inactive_12_mon      0
Contacts_Count_12_mon       0
Credit_Limit                0
Total_Revolving_Bal         0
Avg_Open_To_Buy             0
Total_Amt_Chng_Q4_Q1        0
Total_Trans_Amt             0
Total_Trans_Ct              0
Total_Ct_Chng_Q4_Q1         0
Avg_Utilization_Ratio       0
dtype: int64
------------------------------
Customer_Age                0
Gender                      0
Dependent_count             0
Education_Level             0
Marital_Status              0
Income_Category             0
Card_Category               0
Months_on_book              0
Total_Relationship_Count    0
Months_Inactive_12_mon      0
Contacts_Count_12_mon       0
Credit_Limit                0
Total_Revolving_Bal         0
Avg_Open_To_Buy             0
Total_Amt_Chng_Q4_Q1        0
Total_Trans_Amt             0
Total_Trans_Ct              0
Total_Ct_Chng_Q4_Q1         0
Avg_Utilization_Ratio       0
dtype: int64

Observations

  • The missing values have been properly treated without any data leaks.
In [65]:
# Check the unique values of the categorical variables in train set
cols = X_train.select_dtypes(include=["object"])
for i in cols.columns:
    print(X_train[i].value_counts())
    print("~" * 35)
F    3193
M    2882
Name: Gender, dtype: int64
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Graduate         2782
High School      1228
Uneducated        881
College           618
Post-Graduate     312
Doctorate         254
Name: Education_Level, dtype: int64
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Married     3276
Single      2369
Divorced     430
Name: Marital_Status, dtype: int64
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Less than $40K    2783
$40K - $60K       1059
$80K - $120K       953
$60K - $80K        831
$120K +            449
Name: Income_Category, dtype: int64
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Blue        5655
Silver       339
Gold          69
Platinum      12
Name: Card_Category, dtype: int64
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In [66]:
# Check the unique values of the categorical variables in validation set
cols = X_val.select_dtypes(include=["object"])
for i in cols.columns:
    print(X_val[i].value_counts())
    print("~" * 35)
F    1095
M     931
Name: Gender, dtype: int64
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Graduate         917
High School      404
Uneducated       306
College          199
Post-Graduate    101
Doctorate         99
Name: Education_Level, dtype: int64
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Married     1100
Single       770
Divorced     156
Name: Marital_Status, dtype: int64
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Less than $40K    957
$40K - $60K       361
$80K - $120K      293
$60K - $80K       279
$120K +           136
Name: Income_Category, dtype: int64
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Blue        1905
Silver        97
Gold          21
Platinum       3
Name: Card_Category, dtype: int64
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In [67]:
# Check the unique values of the categorical variables in test set
cols = X_test.select_dtypes(include=["object"])
for i in cols.columns:
    print(X_test[i].value_counts())
    print("~" * 35)
F    1070
M     956
Name: Gender, dtype: int64
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Graduate         948
High School      381
Uneducated       300
College          196
Post-Graduate    103
Doctorate         98
Name: Education_Level, dtype: int64
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Married     1060
Single       804
Divorced     162
Name: Marital_Status, dtype: int64
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Less than $40K    933
$40K - $60K       370
$60K - $80K       292
$80K - $120K      289
$120K +           142
Name: Income_Category, dtype: int64
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Blue        1876
Silver       119
Gold          26
Platinum       5
Name: Card_Category, dtype: int64
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Encoding Categorical Variables¶

In [68]:
### Encoding categorical variables
X_train = pd.get_dummies(X_train, drop_first=True)
X_val = pd.get_dummies(X_val, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)

print(X_train.shape, X_val.shape, X_test.shape)
(6075, 29) (2026, 29) (2026, 29)

Observations

  • After encoding the columns increased from 19 to 29 columns.

Model Building¶

Model Evaluation Criteria¶

The model can make wrong predictions as:

  1. Predicting a customer will attrite and the customer doesn't attrite. These are False Positives (FP) - Minimal loss of resources, if any.
  2. Predicting a customer will not attrite and the customer attrites. These are False Negatives (FN) - Loss of Revenue.

Which case is more important?

  • Predicting that customer will not attrite but he attrites i.e. losing on a valuable customer or asset and all the value that comes with that.

How to reduce this loss i.e need to reduce False Negatives?

  • We know that the Recall (Sensitivity or True Positive Recal (TPR)) is expressed as $Recall Score=\dfrac{TP}{TP+FN}$.
  • The company wants Recall to be maximized, the greater the Recall the lesser the chances of false negatives.
  • The lower the FN the greater the Recall Score, the higher the chances of predicting correctly the customers that will attrite...and not wrongly predicting that a customer will not attrite when they will.

User Defined Functions¶

In [69]:
######
# User defined function to compute different metrics to check performance of a
# classification model built using sklearn
######


def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {
            "Accuracy": acc,
            "Recall": recall,
            "Precision": precision,
            "F1": f1,
        },
        index=[0],
    )

    return df_perf
In [70]:
# defining a function to plot the confusion_matrix of a classification model built using sklearn
def confusion_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

Checking Class Imbalance¶

In [71]:
# -----Checking Class Balance in the Dataset
labeled_barplot(eda_df, "Attrition_Flag", True, True)

Observations

  • The target class Attrition_Flag has a high Class Imbalanced.

Class Balancing Function¶

We have observed previously that the Dependent Variable Attrition_Flag has high class imbalance. Hence we will try out four different class balancing strategies. We define a function below to allow us to modularize the codes.

In [72]:
#We will use SMOTE and RandomOverSampler for OVERSAMPLING
# We will use RandomUnderSampler and NearMiss for UNDERSAMPLING


def Balance_Data(choice):
    
    if choice==1:
        sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
        
    elif choice==2:
        sm = RandomUnderSampler(random_state=1)
        
    elif choice==3:
        sm = NearMiss(version=1)
        
    elif choice==4:
        sm = RandomOverSampler(random_state=1)
        
    elif choice==0:
        return X_train, y_train
    
    #Class balancing technique depending on the choice

    X_train_balanced, y_train_balanced = sm.fit_resample(X_train, y_train)
    
    print("Before Class Balancing, counts of label 'Attrited': {}".format(sum(y_train == 1)))
    print("Before Class Balancing, counts of label 'Not Attrited': {} \n".format(sum(y_train == 0)))
        
    print("After Class Balancing, counts of label 'Yes': {}".format(sum(y_train_balanced == 1)))
    print("After Class Balancing, counts of label 'No': {} \n".format(sum(y_train_balanced == 0)))
        
    print("After Class Balancing, the shape of train_X: {}".format(X_train_balanced.shape))
    print("After Class Balancing, the shape of train_y: {} \n".format(y_train_balanced.shape))
    
    return X_train_balanced, y_train_balanced

Model Creation Function¶

This function creates a model with default setting based on Class Balancing of CHOICE. The models are: Logistic Regression, Decision Tree, Bagging, Random forest, GradientBoost & Xgboost

In [73]:
def Model_Creation(CHOICE):

    """
    This function creates a model with default setting based on Class Balancing of CHOICE.
    The models are: Logistic regression, dtree, Bagging, Random forest, GradientBoost & Xgboost

    CHOICE = CHOICE of Class Balancing
    """

    models = []  # Empty list to store all the models

    # --------------------------Appending models into the list--------------------------

    models.append(("Logistic Regression", LogisticRegression(random_state=1)))

    models.append(("Decision Tree", DecisionTreeClassifier(random_state=1)))

    models.append(("Bagging", BaggingClassifier(random_state=1)))

    models.append(("Random Forest", RandomForestClassifier(random_state=1)))

    models.append(("GBM", GradientBoostingClassifier(random_state=1)))

    models.append(("Xgboost", XGBClassifier(random_state=1, eval_metric="logloss")))

    # -----------------------------------------------------------------------------------

    scores = []  # Empty list to store all model's Recall scores

    scores_val = []  # Empty list to store all model's Recall scores for Validation data

    names = []  # Empty list to store name of the models

    cv_results = []  # Empty list to store all model's CV scores

    # CHOICE =0 returns X_train & y_train... 1 returns SMOTE balanced...etc...
    X_bal, y_bal = Balance_Data(CHOICE)

    # --------------------------loop through all models to get Recall score for train & val--------------------------

    print("\n" "Training Performance (RECALL):" "\n")

    for name, model in models:

        model.fit(X_bal, y_bal)
        score = recall_score(y_bal, model.predict(X_bal))
        print("{}: {}".format(name, score))

        scores.append(score)

        names.append(name)

    print("\n" "Validation Performance (RECALL):" "\n")

    for name, model in models:
        model.fit(X_bal, y_bal)
        score_val = recall_score(y_val, model.predict(X_val))
        print("{}: {}".format(name, score_val))

        scores_val.append(score_val)

    # -----------loop through all models to get cross validation results mean----------------------------------------

    print("\n" "Cross-Validation Performance (Mean):" "\n")
    for name, model in models:
        scoring = "recall"
        kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)
        cv_result = cross_val_score(
            estimator=model, X=X_bal, y=y_bal, scoring=scoring, cv=kfold
        )
        cv_results.append(cv_result)

        print("{}: {}".format(name, cv_result.mean() * 100))

    return scores, scores_val, cv_results, names

Models Creation¶

...with Original (Imbalance) Data¶

In [74]:
scores_IB, scores_val_IB, cv_results_IB, names = Model_Creation(0)
Training Performance (RECALL):

Logistic Regression: 0.4989754098360656
Decision Tree: 1.0
Bagging: 0.985655737704918
Random Forest: 1.0
GBM: 0.875
Xgboost: 1.0

Validation Performance (RECALL):

Logistic Regression: 0.5858895705521472
Decision Tree: 0.8159509202453987
Bagging: 0.8098159509202454
Random Forest: 0.803680981595092
GBM: 0.8558282208588958
Xgboost: 0.8834355828220859

Cross-Validation Performance (Mean):

Logistic Regression: 48.66509680795395
Decision Tree: 77.76556776556777
Bagging: 78.48194662480377
Random Forest: 75.71480900052329
GBM: 81.24646781789639
Xgboost: 86.57613814756672

...with Oversampled Data - SMOTE¶

In [75]:
scores_SMOTE, scores_val_SMOTE, cv_results_SMOTE, names = Model_Creation(1)
Before Class Balancing, counts of label 'Attrited': 976
Before Class Balancing, counts of label 'Not Attrited': 5099 

After Class Balancing, counts of label 'Yes': 5099
After Class Balancing, counts of label 'No': 5099 

After Class Balancing, the shape of train_X: (10198, 29)
After Class Balancing, the shape of train_y: (10198,) 


Training Performance (RECALL):

Logistic Regression: 0.8633065306922926
Decision Tree: 1.0
Bagging: 0.9962737791723868
Random Forest: 1.0
GBM: 0.9760737399490096
Xgboost: 1.0

Validation Performance (RECALL):

Logistic Regression: 0.8159509202453987
Decision Tree: 0.843558282208589
Bagging: 0.8374233128834356
Random Forest: 0.8558282208588958
GBM: 0.8926380368098159
Xgboost: 0.8926380368098159

Cross-Validation Performance (Mean):

Logistic Regression: 85.97812157247591
Decision Tree: 93.68488906848313
Bagging: 94.86185995497316
Random Forest: 97.17598953222112
GBM: 96.92097211799341
Xgboost: 98.05849641132212

...with Undersampled Data - Random Under Sampling¶

In [76]:
scores_RUS, scores_val_RUS, cv_results_RUS, names = Model_Creation(2)
Before Class Balancing, counts of label 'Attrited': 976
Before Class Balancing, counts of label 'Not Attrited': 5099 

After Class Balancing, counts of label 'Yes': 976
After Class Balancing, counts of label 'No': 976 

After Class Balancing, the shape of train_X: (1952, 29)
After Class Balancing, the shape of train_y: (1952,) 


Training Performance (RECALL):

Logistic Regression: 0.8258196721311475
Decision Tree: 1.0
Bagging: 0.9907786885245902
Random Forest: 1.0
GBM: 0.9805327868852459
Xgboost: 1.0

Validation Performance (RECALL):

Logistic Regression: 0.8466257668711656
Decision Tree: 0.9202453987730062
Bagging: 0.9325153374233128
Random Forest: 0.9386503067484663
GBM: 0.9570552147239264
Xgboost: 0.9570552147239264

Cross-Validation Performance (Mean):

Logistic Regression: 81.6624803767661
Decision Tree: 88.52433281004708
Bagging: 90.7796964939822
Random Forest: 93.34013605442178
GBM: 94.26216640502355
Xgboost: 95.28833071690215

...with Undersampled Data - NearMiss¶

In [77]:
scores_NM, scores_val_NM, cv_results_NM, names = Model_Creation(3)
Before Class Balancing, counts of label 'Attrited': 976
Before Class Balancing, counts of label 'Not Attrited': 5099 

After Class Balancing, counts of label 'Yes': 976
After Class Balancing, counts of label 'No': 976 

After Class Balancing, the shape of train_X: (1952, 29)
After Class Balancing, the shape of train_y: (1952,) 


Training Performance (RECALL):

Logistic Regression: 0.8391393442622951
Decision Tree: 1.0
Bagging: 0.992827868852459
Random Forest: 1.0
GBM: 0.9877049180327869
Xgboost: 1.0

Validation Performance (RECALL):

Logistic Regression: 0.8926380368098159
Decision Tree: 0.8957055214723927
Bagging: 0.911042944785276
Random Forest: 0.9601226993865031
GBM: 0.9631901840490797
Xgboost: 0.9693251533742331

Cross-Validation Performance (Mean):

Logistic Regression: 82.27472527472527
Decision Tree: 88.11355311355312
Bagging: 90.98430141287285
Random Forest: 95.49031920460493
GBM: 94.26111983254842
Xgboost: 95.18367346938777

...with Oversampled Data - Random Over Sampling¶

In [78]:
scores_ROS, scores_val_ROS, cv_results_ROS, names = Model_Creation(4)
Before Class Balancing, counts of label 'Attrited': 976
Before Class Balancing, counts of label 'Not Attrited': 5099 

After Class Balancing, counts of label 'Yes': 5099
After Class Balancing, counts of label 'No': 5099 

After Class Balancing, the shape of train_X: (10198, 29)
After Class Balancing, the shape of train_y: (10198,) 


Training Performance (RECALL):

Logistic Regression: 0.8458521278682094
Decision Tree: 1.0
Bagging: 1.0
Random Forest: 1.0
GBM: 0.9811727789762699
Xgboost: 1.0

Validation Performance (RECALL):

Logistic Regression: 0.8404907975460123
Decision Tree: 0.7699386503067485
Bagging: 0.8128834355828221
Random Forest: 0.843558282208589
GBM: 0.9386503067484663
Xgboost: 0.9355828220858896

Cross-Validation Performance (Mean):

Logistic Regression: 84.17337258750409
Decision Tree: 99.80392156862746
Bagging: 99.70588235294117
Random Forest: 99.78429448324964
GBM: 97.74455925647982
Xgboost: 99.9607843137255
In [79]:
pd.DataFrame(cv_results_ROS).mean(axis=1)
Out[79]:
0   0.842
1   0.998
2   0.997
3   0.998
4   0.977
5   1.000
dtype: float64
In [80]:
# Training performance comparison

models_train_comp_df = pd.concat(
    [
        pd.DataFrame(scores_IB),
        pd.DataFrame(scores_SMOTE),
        pd.DataFrame(scores_RUS),
        pd.DataFrame(scores_NM),
        pd.DataFrame(scores_ROS),
    ],
    axis=1,
)
models_train_comp_df.index = [names]
models_train_comp_df.columns = ["Original", "SMOTE", "RUS", "NearMiss", "ROS"]
In [81]:
# Print out the Training performance comparison matrix

print("Training performance comparison:")
models_train_comp_df.T
Training performance comparison:
Out[81]:
Logistic Regression Decision Tree Bagging Random Forest GBM Xgboost
Original 0.499 1.000 0.986 1.000 0.875 1.000
SMOTE 0.863 1.000 0.996 1.000 0.976 1.000
RUS 0.826 1.000 0.991 1.000 0.981 1.000
NearMiss 0.839 1.000 0.993 1.000 0.988 1.000
ROS 0.846 1.000 1.000 1.000 0.981 1.000
In [82]:
# Cross validation performance comparison (Mean)

models_cv_comp_df = pd.concat(
    [
        pd.DataFrame(cv_results_IB).mean(
            axis=1
        ),  # Mean of Cross Validation results across Rows
        pd.DataFrame(cv_results_SMOTE).mean(axis=1),
        pd.DataFrame(cv_results_RUS).mean(axis=1),
        pd.DataFrame(cv_results_NM).mean(axis=1),
        pd.DataFrame(cv_results_ROS).mean(axis=1),
    ],
    axis=1,
)
models_cv_comp_df.index = [names]
models_cv_comp_df.columns = ["Original", "SMOTE", "RUS", "NearMiss", "ROS"]
In [83]:
# Print out the Cross Validation performance comparison matrix

print("Cross Validation performance (Mean) comparison:")
models_cv_comp_df.T
Cross Validation performance (Mean) comparison:
Out[83]:
Logistic Regression Decision Tree Bagging Random Forest GBM Xgboost
Original 0.487 0.778 0.785 0.757 0.812 0.866
SMOTE 0.860 0.937 0.949 0.972 0.969 0.981
RUS 0.817 0.885 0.908 0.933 0.943 0.953
NearMiss 0.823 0.881 0.910 0.955 0.943 0.952
ROS 0.842 0.998 0.997 0.998 0.977 1.000

Observations

  • We have successfully built 30 models. These models werre fitted with training data (+ corresponding oversample and under sample data), then with validation set (+ corresponding oversample and under sample data) and cross validated the 30 models from training set.

  • It is clear from the tables above that the Recall Scores of the Balanced (Oversample & Undersample) data are better than the Original (Imbalanced) data for all our 6 models.

Models Selection¶

The Approach¶

  1. We will plot the BoxPlot of CV scores of all models defined above for the Original (Imbalanced) Data

    ◎ From our observation above; it is clear that the Recall Scores of the Balanced (Oversample & Undersample) data are better than the Original (Imbalanced) data for all our 6 models.

    ◎ So we use the models of the Original (Imbalanced) data as the baseline, knowing that the models from the corresponding Balanced (Oversample or Undersample) data will give the best scores.


  2. We will pick the best 4 algorithms

    ◎ This will help us narrow down from 6 algorithms (30 models) to the best 4 algorithms (16 models - excluding the models of the Original (Imbalanced) data).


  3. We will use the following factors to sift the 16 models: Bias, Variance, and Standard Deviation.

    ◎ This will reduce the models of interest to 6 models (out of 16 models)


  4. We will calculate the Confidence Interval of 95% for each of our 6 models, based on the Cross Validation scores.

  5. We will now compare the Recall scores from the Validation set with these Confidence Intervals.

  6. We will then tune the models using RandomizedSearchCV...while checking the result against Valadation data

  7. We are now ready to choose our BEST model and run it against the Test data

The Selection¶

In [84]:
# Plotting boxplots for CV scores of all models defined above for the Original (Imbalanced) Data
fig = plt.figure()

fig.suptitle("Algorithm Comparison: Original (Imbalance) Data")
ax = fig.add_subplot(111)

plt.boxplot(cv_results_IB)
ax.set_xticklabels(names, rotation="vertical")

plt.show()

Observations

  • XGBoost is clearly the best model with lowest Bias (highest Mean Recall score) and lowest Variance (compact IQR and lowest Standard Deviation).

  • GBM is the second best model with relatively low Bias (high Mean Recall score) and low Variance.

  • Bagging and Random Forest have similar Bias even though Random Forest has a lower Variance.

  • Logistic Regression has a high Bias (lowest Mean Recall score) and high Variance (high standard deviation).

  • Decision Tree has the highest Variance (highest standard deviation).

The 4 Best Algorithms are XGBoost, GradientBoost, Bagging and Random Forest.¶

Each has 4 Models based on our 4 Data Balancing Strategies (SMOTE, RandomUndersampler, NearMiss & RandomOversampler)...making 16 models in total.

We will carry out further analysis to arrive at 5 models.

In [85]:
#---------------------
# Print the Cross Validation Statistics of the 16 models: mean, max and STD
# These statistics tell the story of Bias & Variance and we will able to sift the models down to 6 
#---------------------

print("------------------------Bagging Cross Validation Statistics--------------------------")

print('For SMOTE')
print('Mean CV Score:', cv_results_SMOTE[2].mean())
print('Max CV Score:', cv_results_SMOTE[2].max())
print('CV STD:', cv_results_SMOTE[2].std())
print('\n')
print("~" * 35)
print('For RUS')
print('Mean CV Score:', cv_results_RUS[2].mean())
print('Max CV Score:', cv_results_RUS[2].max())
print('CV STD:', cv_results_RUS[2].std())
print('\n')
print("~" * 35)
print('For NearMiss') 
print('Mean CV Score:', cv_results_NM[2].mean())
print('Max CV Score:', cv_results_NM[2].max())
print('CV STD:', cv_results_NM[2].std())
print('\n')
print("~" * 35)
print('For ROS') 
print('Mean CV Score:', cv_results_ROS[2].mean())
print('Max CV Score:', cv_results_ROS[2].max())
print('CV STD:', cv_results_ROS[2].std())


print('\n')
print("------------------------Random Forest Cross Validation Statistics--------------------------")

print('For SMOTE')
print('Mean CV Score:', cv_results_SMOTE[3].mean())
print('Max CV Score:', cv_results_SMOTE[3].max())
print('CV STD:', cv_results_SMOTE[3].std())
print('\n')
print("~" * 35)
print('For RUS')
print('Mean CV Score:', cv_results_RUS[3].mean())
print('Max CV Score:', cv_results_RUS[3].max())
print('CV STD:', cv_results_RUS[3].std())
print('\n')
print("~" * 35)
print('For NearMiss')
print('Mean CV Score:', cv_results_NM[3].mean())
print('Max CV Score:', cv_results_NM[3].max())
print('CV STD:', cv_results_NM[3].std())
print('\n')
print("~" * 35)
print('For ROS')
print('Mean CV Score:', cv_results_ROS[3].mean())
print('Max CV Score:', cv_results_ROS[3].max())
print('CV STD:', cv_results_ROS[3].std())

print('\n')
print("------------------------GradientBoost Cross Validation Statistics--------------------------")

print('For SMOTE')
print('Mean CV Score:', cv_results_SMOTE[4].mean())
print('Max CV Score:', cv_results_SMOTE[4].max())
print('CV STD:', cv_results_SMOTE[4].std())
print('\n')
print("~" * 35)
print('For RUS')
print('Mean CV Score:', cv_results_RUS[4].mean())
print('Max CV Score:', cv_results_RUS[4].max())
print('CV STD:', cv_results_RUS[4].std())
print('\n')
print("~" * 35)
print('For NearMiss')
print('Mean CV Score:', cv_results_NM[4].mean())
print('Max CV Score:', cv_results_NM[4].max())
print('CV STD:', cv_results_NM[4].std())
print('\n')
print("~" * 35)
print('For ROS')
print('Mean CV Score:', cv_results_ROS[4].mean())
print('Max CV Score:', cv_results_ROS[4].max())
print('CV STD:', cv_results_ROS[4].std())

print('\n')
print("------------------------XGBoost Cross Validation Statistics--------------------------")

print('For SMOTE')
print('Mean CV Score:', cv_results_SMOTE[5].mean())
print('Max CV Score:', cv_results_SMOTE[5].max())
print('CV STD:', cv_results_SMOTE[5].std())
print('\n')
print("~" * 35)
print('For RUS')
print('Mean CV Score:', cv_results_RUS[5].mean())
print('Max CV Score:', cv_results_RUS[5].max())
print('CV STD:', cv_results_RUS[5].std())
print('\n')
print("~" * 35)
print('For NearMiss')
print('Mean CV Score:', cv_results_NM[5].mean())
print('Max CV Score:', cv_results_NM[5].max())
print('CV STD:', cv_results_NM[5].std())
print('\n')
print("~" * 35)
print('For ROS')
print('Mean CV Score:', cv_results_ROS[5].mean())
print('Max CV Score:', cv_results_ROS[5].max())
print('CV STD:', cv_results_ROS[5].std())
------------------------Bagging Cross Validation Statistics--------------------------
For SMOTE
Mean CV Score: 0.9486185995497316
Max CV Score: 0.9548577036310107
CV STD: 0.00421964722419514


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
For RUS
Mean CV Score: 0.907796964939822
Max CV Score: 0.9384615384615385
CV STD: 0.0167904933180432


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
For NearMiss
Mean CV Score: 0.9098430141287285
Max CV Score: 0.9333333333333333
CV STD: 0.012336736910721664


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
For ROS
Mean CV Score: 0.9970588235294118
Max CV Score: 1.0
CV STD: 0.003100272215851341


------------------------Random Forest Cross Validation Statistics--------------------------
For SMOTE
Mean CV Score: 0.9717598953222113
Max CV Score: 0.9754901960784313
CV STD: 0.004652764434332977


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
For RUS
Mean CV Score: 0.9334013605442177
Max CV Score: 0.9641025641025641
CV STD: 0.0197290681683787


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
For NearMiss
Mean CV Score: 0.9549031920460493
Max CV Score: 0.9693877551020408
CV STD: 0.010992120364461887


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
For ROS
Mean CV Score: 0.9978429448324964
Max CV Score: 0.9990196078431373
CV STD: 0.0011431259475611552


------------------------GradientBoost Cross Validation Statistics--------------------------
For SMOTE
Mean CV Score: 0.9692097211799341
Max CV Score: 0.9715686274509804
CV STD: 0.0020176271846683437


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
For RUS
Mean CV Score: 0.9426216640502355
Max CV Score: 0.9538461538461539
CV STD: 0.010465803244080312


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
For NearMiss
Mean CV Score: 0.9426111983254841
Max CV Score: 0.9540816326530612
CV STD: 0.008883415425976452


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
For ROS
Mean CV Score: 0.9774455925647981
Max CV Score: 0.9803921568627451
CV STD: 0.002712530311165646


------------------------XGBoost Cross Validation Statistics--------------------------
For SMOTE
Mean CV Score: 0.9805849641132213
Max CV Score: 0.9833333333333333
CV STD: 0.00368676461417039


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
For RUS
Mean CV Score: 0.9528833071690215
Max CV Score: 0.9692307692307692
CV STD: 0.010373724357350156


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
For NearMiss
Mean CV Score: 0.9518367346938776
Max CV Score: 0.9591836734693877
CV STD: 0.006999809706174566


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
For ROS
Mean CV Score: 0.9996078431372549
Max CV Score: 1.0
CV STD: 0.0007843137254901933

Observations

  • Considering the Mean score, Max score and Standard Deviation, 6 models are of interest: XGBoost with RandomOversampling, Bagging with RandomOversampling, GradientBoost with RandomOversampling, Forest with Oversampling, XGBoost with SMOTE and Random Forest with SMOTE

  • These 6 models have the best Bias vs Variance combination i.e. lowest Bias (highest Mean Recall scores) and lowest Variance (lowest Standard Deviations).

Agregating the CV scores for the 6 models we are interested in...

In [86]:
# Models that Qualified so far
q_models = [
    "Bagging_ROS",
    "RandomForest_ROS",
    "GradientBoost_ROS",
    "XGBoost_ROS",
    "RandomForest_SMOTE",
    "XGBoost_SMOTE",
]

# Cross Validation Score (CVS) labels
cvs_labels = ["cvs_1", "cvs_2", "cvs_3", "cvs_4", "cvs_5"]

# Define an empty list
cv_scores = []

# Append the cross validation scores
cv_scores.append(cv_results_ROS[2])
cv_scores.append(cv_results_ROS[3])
cv_scores.append(cv_results_ROS[4])
cv_scores.append(cv_results_ROS[5])
cv_scores.append(cv_results_SMOTE[3])
cv_scores.append(cv_results_SMOTE[5])

# Convert to Dataframe
cv_scores_df = pd.DataFrame(cv_scores, index=q_models, columns=cvs_labels)

cv_scores_df = cv_scores_df.T

cv_scores_df
Out[86]:
Bagging_ROS RandomForest_ROS GradientBoost_ROS XGBoost_ROS RandomForest_SMOTE XGBoost_SMOTE
cvs_1 0.991 0.996 0.980 1.000 0.968 0.974
cvs_2 0.999 0.998 0.977 1.000 0.975 0.983
cvs_3 0.998 0.997 0.979 0.998 0.975 0.982
cvs_4 1.000 0.999 0.973 1.000 0.975 0.983
cvs_5 0.997 0.999 0.977 1.000 0.965 0.980
In [87]:
# Create the boxplot
plt.boxplot(cv_scores_df)

# Customize the x-tick labels
plt.xticks(
    [1, 2, 3, 4, 5, 6],
    [
        "Bagging_ROS",
        "RandomForest_ROS",
        "GradientBoost_ROS",
        "XGBoost_ROS",
        "RandomForest_SMOTE",
        "XGBoost_SMOTE",
    ],
    rotation="vertical",
)

plt.show()

Observations

  • While cross-validation scores are not the final assessment of model performance, Bagging_ROS, RandomForest_ROS & XGBoost_ROS are showing a lot of promise.

We will calculate the Confidence Intervals and hypertune the models.¶

Then compare the Recall Score of each tuned model's performance on validation set with Confidence Interval.¶

Calculating the Confidence Interval for the models we are interested in...

In [88]:
# Calculate the confidence interval for each model's scores

# A confidence interval of 95% lies between 2.5% and 97.5%
lower_percentile = 2.5  # Lower percentile for the confidence interval

upper_percentile = 97.5  # Upper percentile for the confidence interval

confidence_intervals_df = cv_scores_df.quantile(
    [lower_percentile / 100, upper_percentile / 100]
)

# Transpose the DataFrame for a better display
confidence_intervals_df = confidence_intervals_df.T

confidence_intervals_df.columns = ["Lower CI", "Upper CI"]

confidence_intervals_df
Out[88]:
Lower CI Upper CI
Bagging_ROS 0.992 1.000
RandomForest_ROS 0.996 0.999
GradientBoost_ROS 0.973 0.980
XGBoost_ROS 0.998 1.000
RandomForest_SMOTE 0.965 0.975
XGBoost_SMOTE 0.974 0.983

Getting the Recall Scores of Running the Models on Validation set...

Then we compare the scores of the 6 models of interest to see if they lie within the confidence intervals

In [89]:
# Validation performance comparison

models_val_comp_df = pd.concat(
    [
        pd.DataFrame(scores_val_IB),
        pd.DataFrame(scores_val_SMOTE),
        pd.DataFrame(scores_val_RUS),
        pd.DataFrame(scores_val_NM),
        pd.DataFrame(scores_val_ROS),
    ],
    axis=1,
)
models_val_comp_df.index = [names]
models_val_comp_df.columns = ["Original", "SMOTE", "RUS", "NearMiss", "ROS"]
In [90]:
# Print out the Validation performance comparison matrix

print("Validation Performance Comparison:")
models_val_comp_df.T
Validation Performance Comparison:
Out[90]:
Logistic Regression Decision Tree Bagging Random Forest GBM Xgboost
Original 0.586 0.816 0.810 0.804 0.856 0.883
SMOTE 0.816 0.844 0.837 0.856 0.893 0.893
RUS 0.847 0.920 0.933 0.939 0.957 0.957
NearMiss 0.893 0.896 0.911 0.960 0.963 0.969
ROS 0.840 0.770 0.813 0.844 0.939 0.936
In [91]:
#######
# We will add the validation performance scores beside the Confidence Intervals
# so that it will be easy for us to compare the results.
#######

val_perf = []  # create an emprty list

# Append the cross validation performance scores
val_perf.append(scores_val_ROS[2])
val_perf.append(scores_val_ROS[3])
val_perf.append(scores_val_ROS[4])
val_perf.append(scores_val_ROS[5])
val_perf.append(scores_val_SMOTE[3])
val_perf.append(scores_val_SMOTE[5])

# Add the list to the CI dataframe
confidence_intervals_df["Val_Perf"] = val_perf

confidence_intervals_df
Out[91]:
Lower CI Upper CI Val_Perf
Bagging_ROS 0.992 1.000 0.813
RandomForest_ROS 0.996 0.999 0.844
GradientBoost_ROS 0.973 0.980 0.939
XGBoost_ROS 0.998 1.000 0.936
RandomForest_SMOTE 0.965 0.975 0.856
XGBoost_SMOTE 0.974 0.983 0.893

Observations

  • None of the validation performace scores are up to the 95% Confidence Interval

  • We will perform hyper-parameter tuning to seek to improve the performance of the models

Hyperparameter Tuning¶

Tuning Bagging with ROS¶

In [92]:
######
## Tune Bagging with ROS using RandomizedSearchCV
######

# Define the base estimator
base_estimator = DecisionTreeClassifier(random_state=1)

# Define the Bagging classifier
bgc = BaggingClassifier(base_estimator=base_estimator, random_state=1)

# Define the hyperparameter grid
param_grid = {
    'n_estimators': np.arange(10, 100, 10),
    'max_samples': np.arange(0.1, 1.1, 0.1),
    'max_features': np.arange(0.1, 1.1, 0.1),
    'bootstrap': [True, False],
    'base_estimator': [
        DecisionTreeClassifier(max_depth=1, random_state=1),
        DecisionTreeClassifier(max_depth=2, random_state=1),
        DecisionTreeClassifier(max_depth=3, random_state=1),
        DecisionTreeClassifier(max_depth=4, random_state=1),
        DecisionTreeClassifier(max_depth=5, random_state=1),
    ],

}

# Perform random search cross-validation
bag_ROS_tuned = RandomizedSearchCV(bgc, param_distributions=param_grid, n_iter=50, cv=5, scoring='recall', random_state=1, n_jobs = -1)

X_train_ROS, y_train_ROS = Balance_Data(4) #Generate Oversampled data using ROS - option 4

bag_ROS_tuned.fit(X_train_ROS, y_train_ROS)
Before Class Balancing, counts of label 'Attrited': 976
Before Class Balancing, counts of label 'Not Attrited': 5099 

After Class Balancing, counts of label 'Yes': 5099
After Class Balancing, counts of label 'No': 5099 

After Class Balancing, the shape of train_X: (10198, 29)
After Class Balancing, the shape of train_y: (10198,) 

Out[92]:
RandomizedSearchCV(cv=5,
                   estimator=BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=1),
                                               random_state=1),
                   n_iter=50, n_jobs=-1,
                   param_distributions={'base_estimator': [DecisionTreeClassifier(max_depth=1,
                                                                                  random_state=1),
                                                           DecisionTreeClassifier(max_depth=2,
                                                                                  random_state=1),
                                                           DecisionTreeClassifier(max_depth=3,
                                                                                  random_state=1),
                                                           DecisionTreeClassifier(max_depth=4,
                                                                                  random_state=1),
                                                           DecisionTreeClassifier(max_depth=5,
                                                                                  random_state=1)],
                                        'bootstrap': [True, False],
                                        'max_features': array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]),
                                        'max_samples': array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]),
                                        'n_estimators': array([10, 20, 30, 40, 50, 60, 70, 80, 90])},
                   random_state=1, scoring='recall')
In [93]:
# checking the model performance on Validation set
bag_ROS_model_val_perf = model_performance_classification_sklearn(
    bag_ROS_tuned, X_val, y_val
)
print("bag_ROS_tuned Validation performance:")
bag_ROS_model_val_perf
bag_ROS_tuned Validation performance:
Out[93]:
Accuracy Recall Precision F1
0 0.928 0.945 0.708 0.809

Tuning RandomForest with ROS¶

In [94]:
######
## Tune RandomForest with ROS using RandomizedSearchCV
######

# Define the Random Forest classifier
rf = RandomForestClassifier(random_state=1)

# Define the hyperparameter grid
param_grid = {
    'n_estimators': np.arange(100, 1000, 100),
    'max_depth': np.arange(2, 20),
    'min_samples_leaf': np.arange(1, 10),
    'max_features': np.arange(0.2, 0.8, 0.1),
    'criterion': ['gini', 'entropy'],
    'bootstrap': [True, False],
    'class_weight': ['balanced', 'balanced_subsample'],
    'min_impurity_decrease':[0.001, 0.002, 0.003]
}
        
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

# Perform random search cross-validation
rf_ROS_tuned = RandomizedSearchCV(rf, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1, n_jobs = -1)

X_train_ROS, y_train_ROS = Balance_Data(4) #Generate Oversampled data using ROS - option 4

rf_ROS_tuned.fit(X_train_ROS, y_train_ROS)
Before Class Balancing, counts of label 'Attrited': 976
Before Class Balancing, counts of label 'Not Attrited': 5099 

After Class Balancing, counts of label 'Yes': 5099
After Class Balancing, counts of label 'No': 5099 

After Class Balancing, the shape of train_X: (10198, 29)
After Class Balancing, the shape of train_y: (10198,) 

Out[94]:
RandomizedSearchCV(cv=5, estimator=RandomForestClassifier(random_state=1),
                   n_iter=50, n_jobs=-1,
                   param_distributions={'bootstrap': [True, False],
                                        'class_weight': ['balanced',
                                                         'balanced_subsample'],
                                        'criterion': ['gini', 'entropy'],
                                        'max_depth': array([ 2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
       19]),
                                        'max_features': array([0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8]),
                                        'min_impurity_decrease': [0.001, 0.002,
                                                                  0.003],
                                        'min_samples_leaf': array([1, 2, 3, 4, 5, 6, 7, 8, 9]),
                                        'n_estimators': array([100, 200, 300, 400, 500, 600, 700, 800, 900])},
                   random_state=1, scoring=make_scorer(recall_score))
In [95]:
# checking the model performance for Validation set
rf_ROS_model_val_perf = model_performance_classification_sklearn(
    rf_ROS_tuned, X_val, y_val
)
print("rf_SMOTE Validation performance:")
rf_ROS_model_val_perf
rf_SMOTE Validation performance:
Out[95]:
Accuracy Recall Precision F1
0 0.950 0.911 0.803 0.853

Tuning GradientBoost with ROS¶

In [96]:
######
## Tune GradientBoost with ROS using RandomizedSearchCV
######

# Define the Gradient Boosting classifier
gb = GradientBoostingClassifier()

# Define the hyperparameter grid
param_grid = {
    'n_estimators': np.arange(100, 1000, 100),
    'learning_rate': [0.1, 0.01, 0.001],
    'max_depth': np.arange(2, 10),
    'min_samples_split': np.arange(2, 10),
    'min_samples_leaf': np.arange(1, 10),
    'max_features': ['sqrt', 'log2'],
}    
        
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

# Perform random search cross-validation
gb_ROS_tuned = RandomizedSearchCV(gb, param_distributions=param_grid, n_iter=50, cv=5, scoring=scorer, random_state=1, n_jobs = -1)

#Generate Oversampled data using RandomOversampler - option 4
X_train_ROS, y_train_ROS = Balance_Data(4) 

#Fitting the model oversampled (ROS) train set
gb_ROS_tuned.fit(X_train_ROS, y_train_ROS)
Before Class Balancing, counts of label 'Attrited': 976
Before Class Balancing, counts of label 'Not Attrited': 5099 

After Class Balancing, counts of label 'Yes': 5099
After Class Balancing, counts of label 'No': 5099 

After Class Balancing, the shape of train_X: (10198, 29)
After Class Balancing, the shape of train_y: (10198,) 

Out[96]:
RandomizedSearchCV(cv=5, estimator=GradientBoostingClassifier(), n_iter=50,
                   n_jobs=-1,
                   param_distributions={'learning_rate': [0.1, 0.01, 0.001],
                                        'max_depth': array([2, 3, 4, 5, 6, 7, 8, 9]),
                                        'max_features': ['sqrt', 'log2'],
                                        'min_samples_leaf': array([1, 2, 3, 4, 5, 6, 7, 8, 9]),
                                        'min_samples_split': array([2, 3, 4, 5, 6, 7, 8, 9]),
                                        'n_estimators': array([100, 200, 300, 400, 500, 600, 700, 800, 900])},
                   random_state=1, scoring=make_scorer(recall_score))
In [97]:
# checking the model performance for Validation set
gb_ROS_model_val_perf = model_performance_classification_sklearn(
    gb_ROS_tuned, X_val, y_val
)
print("gb_ROS_tuned Validation performance:")
gb_ROS_model_val_perf
gb_ROS_tuned Validation performance:
Out[97]:
Accuracy Recall Precision F1
0 0.973 0.899 0.933 0.916

Tuning XGBoost with ROS¶

In [98]:
######
## Tune XGBoost with ROS using RandomizedSearchCV
######

# defining model
model = XGBClassifier(random_state=1,eval_metric='logloss')

# Parameter grid to pass in RandomizedSearchCV
param_grid={'n_estimators':np.arange(50,200,50),
            'scale_pos_weight':[1,2,5,10,12,15],
            'learning_rate':[0.01,0.1,0.2,0.05],
            'gamma':[0,1,3,5],
            'subsample':[0.7,0.8,0.9,1],
            'max_depth':np.arange(1,7,1),
            'reg_lambda':[5,10,12,15]}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
XGBoost_ROS_tuned = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1, n_jobs = -1)

#Generate Oversampled data using RandomOversampler - option 4
X_train_ROS, y_train_ROS = Balance_Data(4) 

#Fitting parameters in RandomizedSearchCV
XGBoost_ROS_tuned.fit(X_train_ROS, y_train_ROS)

# Access the best estimator from the RandomizedSearchCV
XGBoost_ROS_tuned.best_params_
Before Class Balancing, counts of label 'Attrited': 976
Before Class Balancing, counts of label 'Not Attrited': 5099 

After Class Balancing, counts of label 'Yes': 5099
After Class Balancing, counts of label 'No': 5099 

After Class Balancing, the shape of train_X: (10198, 29)
After Class Balancing, the shape of train_y: (10198,) 

Out[98]:
{'subsample': 1,
 'scale_pos_weight': 12,
 'reg_lambda': 10,
 'n_estimators': 150,
 'max_depth': 1,
 'learning_rate': 0.01,
 'gamma': 0}
In [99]:
# Initiate XGBoost with the best parameters of RandomizedSearchCV
XGBoost_ROS_tuned = XGBClassifier(
    random_state=1,
    eval_metric="logloss",
    subsample=1,
    scale_pos_weight=12,
    reg_lambda=10,
    n_estimators=150,
    max_depth=1,
    learning_rate=0.01,
    gamma=0,
)

# Fitting
XGBoost_ROS_tuned.fit(X_train_ROS, y_train_ROS)
Out[99]:
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric='logloss',
              feature_types=None, gamma=0, gpu_id=None, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=0.01, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=1,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, n_estimators=150, n_jobs=None,
              num_parallel_tree=None, predictor=None, random_state=1, ...)
In [100]:
# checking the model performance for Validation set
XGBoost_ROS_model_val_perf = model_performance_classification_sklearn(
    XGBoost_ROS_tuned, X_val, y_val
)
print("XGBoost_ROS Validation performance:")
XGBoost_ROS_model_val_perf
XGBoost_ROS Validation performance:
Out[100]:
Accuracy Recall Precision F1
0 0.161 1.000 0.161 0.277

Tuning RandomForest with SMOTE¶

In [101]:
######
## Tune RandomForest with SMOTE using RandomizedSearchCV
######

# Choose the type of classifier. 
rf2 = RandomForestClassifier(random_state=1)

# Grid of parameters to choose from
parameters = {"n_estimators": [150,200,250],
    "min_samples_leaf": np.arange(5, 10),
    "max_features": np.arange(0.2, 0.7, 0.1), 
    "max_samples": np.arange(0.3, 0.7, 0.1),
    "max_depth":np.arange(3,4,5),
    "class_weight" : ['balanced', 'balanced_subsample'],
    "min_impurity_decrease":[0.001, 0.002, 0.003]
             }

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

# Run the random search
rf_SMOTE_tuned = RandomizedSearchCV(rf2, parameters,n_iter=30, scoring=scorer,cv=5, random_state = 1, n_jobs = -1)

X_train_SMOTE, y_train_SMOTE = Balance_Data(1) #Generate Oversampled data using SMOTE - option 1

rf_SMOTE_tuned.fit(X_train_SMOTE, y_train_SMOTE)
Before Class Balancing, counts of label 'Attrited': 976
Before Class Balancing, counts of label 'Not Attrited': 5099 

After Class Balancing, counts of label 'Yes': 5099
After Class Balancing, counts of label 'No': 5099 

After Class Balancing, the shape of train_X: (10198, 29)
After Class Balancing, the shape of train_y: (10198,) 

Out[101]:
RandomizedSearchCV(cv=5, estimator=RandomForestClassifier(random_state=1),
                   n_iter=30, n_jobs=-1,
                   param_distributions={'class_weight': ['balanced',
                                                         'balanced_subsample'],
                                        'max_depth': array([3]),
                                        'max_features': array([0.2, 0.3, 0.4, 0.5, 0.6]),
                                        'max_samples': array([0.3, 0.4, 0.5, 0.6]),
                                        'min_impurity_decrease': [0.001, 0.002,
                                                                  0.003],
                                        'min_samples_leaf': array([5, 6, 7, 8, 9]),
                                        'n_estimators': [150, 200, 250]},
                   random_state=1, scoring=make_scorer(recall_score))
In [102]:
# checking the model performance for Validation set
rf_SMOTE_model_val_perf = model_performance_classification_sklearn(
    rf_SMOTE_tuned, X_val, y_val
)
print("rf_SMOTE_tuned Validation performance:")
rf_SMOTE_model_val_perf
rf_SMOTE_tuned Validation performance:
Out[102]:
Accuracy Recall Precision F1
0 0.870 0.902 0.560 0.691

Tuning XGBoost with SMOTE¶

In [103]:
######
## Tune XGBoost with SMOTE using RandomizedSearchCV
######

# defining model
model = XGBClassifier(random_state=1,eval_metric='logloss')

# Parameter grid to pass in RandomizedSearchCV
param_grid={'n_estimators':np.arange(50,200,50),
            'scale_pos_weight':[1,2,5,10,12,15],
            'learning_rate':[0.01,0.1,0.2,0.05],
            'gamma':[0,1,3,5],
            'subsample':[0.7,0.8,0.9,1],
            'max_depth':np.arange(1,7,1),
            'reg_lambda':[5,10,12,15]}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
XGBoost_SMOTE_tuned = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1, n_jobs = -1)

#Generate Oversampled data using SMOTE - option 1
X_train_SMOTE, y_train_SMOTE = Balance_Data(1)

#Fitting parameters in RandomizedSearchCV
XGBoost_SMOTE_tuned.fit(X_train_SMOTE, y_train_SMOTE)

# Access the best estimator from the RandomizedSearchCV
XGBoost_SMOTE_tuned.best_params_
Before Class Balancing, counts of label 'Attrited': 976
Before Class Balancing, counts of label 'Not Attrited': 5099 

After Class Balancing, counts of label 'Yes': 5099
After Class Balancing, counts of label 'No': 5099 

After Class Balancing, the shape of train_X: (10198, 29)
After Class Balancing, the shape of train_y: (10198,) 

Out[103]:
{'subsample': 1,
 'scale_pos_weight': 15,
 'reg_lambda': 10,
 'n_estimators': 100,
 'max_depth': 2,
 'learning_rate': 0.01,
 'gamma': 0}
In [104]:
# Initiate XGBoost with the best parameters of RandomizedSearchCV
XGBoost_SMOTE_tuned = XGBClassifier(
    random_state=1,
    eval_metric="logloss",
    subsample=1,
    scale_pos_weight=15,
    reg_lambda=10,
    n_estimators=100,
    max_depth=2,
    learning_rate=0.01,
    gamma=0,
)

# Fitting
XGBoost_SMOTE_tuned.fit(X_train_SMOTE, y_train_SMOTE)
Out[104]:
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric='logloss',
              feature_types=None, gamma=0, gpu_id=None, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=0.01, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=2,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, n_estimators=100, n_jobs=None,
              num_parallel_tree=None, predictor=None, random_state=1, ...)
In [105]:
# checking the model performance for Validation set
XGBoost_SMOTE_model_val_perf = model_performance_classification_sklearn(
    XGBoost_SMOTE_tuned, X_val, y_val
)
print("XGBoost_SMOTE Validation performance:")
XGBoost_SMOTE_model_val_perf
XGBoost_SMOTE Validation performance:
Out[105]:
Accuracy Recall Precision F1
0 0.535 1.000 0.257 0.409

Model Comparison and Final Selection¶

In [106]:
#######
# We will add the validation set performance scores for the Tuned Models beside the Confidence Intervals
# so that it will be easy for us to compare the results.
#######

tuned_val_perf = []  # create an emprty list

# Append the Recall scores for the Tuned Models on validation data
tuned_val_perf.append(bag_ROS_model_val_perf.at[0, "Recall"])
tuned_val_perf.append(rf_ROS_model_val_perf.at[0, "Recall"])
tuned_val_perf.append(gb_ROS_model_val_perf.at[0, "Recall"])
tuned_val_perf.append(XGBoost_ROS_model_val_perf.at[0, "Recall"])
tuned_val_perf.append(rf_SMOTE_model_val_perf.at[0, "Recall"])
tuned_val_perf.append(XGBoost_SMOTE_model_val_perf.at[0, "Recall"])

# Add the list to the CI dataframe
confidence_intervals_df["Tuned_Val_Perf"] = tuned_val_perf

confidence_intervals_df
Out[106]:
Lower CI Upper CI Val_Perf Tuned_Val_Perf
Bagging_ROS 0.992 1.000 0.813 0.945
RandomForest_ROS 0.996 0.999 0.844 0.911
GradientBoost_ROS 0.973 0.980 0.939 0.899
XGBoost_ROS 0.998 1.000 0.936 1.000
RandomForest_SMOTE 0.965 0.975 0.856 0.902
XGBoost_SMOTE 0.974 0.983 0.893 1.000

Observations

  • The XGBoost_SMOTE_tuned and XGBoost_ROS_tuned both achieved a Recall score on Validation set of 1.0 (100%).

  • The BEST model is XGBoost_SMOTE_tuned and it gives a Confidence that is greater than 95% Confidence Interval.

  • Our 2nd best model is XGBoost_ROS_tuned and it gives us a Confidence with 95% Confidence Interval.

  • Bagging_ROS_tuned is our 3rd best model with a Recall score on Validation set of 0.945 (94.5%)...which is less than 95% Confidence Interval.

  • All our tuned models gives us a Recall score on Validation set that is > 90%

How The BEST Model Performs on Unseen Test Data¶

In [107]:
# Checking the model performance for Unseen Test set
XGBoost_ROS_model_test_perf = model_performance_classification_sklearn(
    XGBoost_ROS_tuned, X_test, y_test
)
print("XGBoost_ROS_tuned TEST performance:")
XGBoost_ROS_model_test_perf
XGBoost_ROS_tuned TEST performance:
Out[107]:
Accuracy Recall Precision F1
0 0.160 1.000 0.160 0.276
In [108]:
# Confusion matrix
confusion_matrix_sklearn(XGBoost_ROS_tuned, X_test, y_test)

Observations

  • The BEST Model is XGBoost_ROS_tuned with a Recall Score of 1.0 (100%) on both Validation and Test data...and it gives a Confidence that is greater than 95% CI

Feature Importances¶

In [109]:
#Plot the feature importances
feature_names = X_train.columns
importances = XGBoost_ROS_tuned.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

Observations

  • Looking at feature importances the Total_Trans_Ct and Total_Revolving_Bal are most important features.
  • All other features seem to be insignificant.

SHAP Summary Plot¶

In [110]:
# Import shap - helps in visualizing the relationships in the model
import shap

# Initialize the package
shap.initjs()
In [111]:
# calculating SHAP values
explainer = shap.TreeExplainer(XGBoost_ROS_tuned)
shap_values = explainer.shap_values(X_train)
In [112]:
# Make plot.
shap.summary_plot(shap_values, X_train)

Observations

  • Total_Trans_Ct and Total_Revolving_Bal are the top (and only) two important features that contributes in the prediction of target.

  • All the other features have a SHAP value of 0 (SHAP = 0.0) meaning that the variables have no impact on predictions

Pipeline: Productionize the BEST Model¶

In [113]:
######
# User defined function to be used by FunctionTransformer in creating Pipeline.
######


def myProcessingSteps(df):

    # -----
    # Drop CLIENTNUM
    # -----
    df.drop(["CLIENTNUM"], axis=1, inplace=True)

    # -----
    # Replacing the corrupt data containing 'abc' with NAN
    # -----
    df.replace("abc", np.nan, inplace=True)

    # -----
    # Perform a log transformation of the numerical columns
    # -----
    for col in num_features:
        df[col] = np.log(df[col] + 1)

    # -----
    # Minmax scaling of numeric features
    # -----
    for col in num_features:
        df[col] = MinMaxScaler().fit_transform(df[[col]])

    return df

FunctionTransformer¶

Convert myProcessingSteps() into a transformer object that will be integrated Pipeline.

In [114]:
func_transform = FunctionTransformer(myProcessingSteps)

ColumnTransformer¶

Apply different preprocessing steps to different subsets of columns.

In [115]:
# Creating a transformer for numerical variables, which will apply simple imputer on the numerical variables
numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
    ]
)

# Creating a transformer for categorical variables, which will first apply simple imputer and
# then do one hot encoding for categorical variables
categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(handle_unknown="ignore")),
    ]
)

# Combining categorical transformer and numerical transformer using a column transformer
col_transform = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, num_features),
        ("cat", categorical_transformer, cat_features),
    ],
    remainder="passthrough",
)

Build Pipeline¶

In [116]:
# Check the original dataset
df.head()
Out[116]:
CLIENTNUM Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
0 768805383 Existing Customer 45 M 3 High School Married $60K - $80K Blue 39 5 1 3 12691.000 777 11914.000 1.335 1144 42 1.625 0.061
1 818770008 Existing Customer 49 F 5 Graduate Single Less than $40K Blue 44 6 1 2 8256.000 864 7392.000 1.541 1291 33 3.714 0.105
2 713982108 Existing Customer 51 M 3 Graduate Married $80K - $120K Blue 36 4 1 0 3418.000 0 3418.000 2.594 1887 20 2.333 0.000
3 769911858 Existing Customer 40 F 4 High School NaN Less than $40K Blue 34 3 4 1 3313.000 2517 796.000 1.405 1171 20 2.333 0.760
4 709106358 Existing Customer 40 M 3 Uneducated Married $60K - $80K Blue 21 5 1 0 4716.000 0 4716.000 2.175 816 28 2.500 0.000
In [117]:
# Dividing dataset into X and y

X = df.drop(["Attrition_Flag"], axis=1)
y = df["Attrition_Flag"].apply(lambda x: 1 if x == "Attrited Customer" else 0)
In [118]:
# Splitting data into training and test set:

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=1, stratify=y
)

print(X_train.shape, X_test.shape)
(7088, 20) (3039, 20)
In [119]:
churn_pipe = make_imb_pipeline(
    func_transform,
    col_transform,
    RandomOverSampler(random_state=1),
    XGBClassifier(
        random_state=1,
        eval_metric="logloss",
        subsample=1,
        scale_pos_weight=12,
        reg_lambda=10,
        n_estimators=150,
        max_depth=1,
        learning_rate=0.01,
        gamma=0,
    ),
)
In [120]:
# Fit the model on training data
churn_pipe.fit(X_train, y_train)
Out[120]:
Pipeline(steps=[('functiontransformer',
                 FunctionTransformer(func=<function myProcessingSteps at 0x7f8f114f9790>)),
                ('columntransformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median'))]),
                                                  ['Customer_Age',
                                                   'Months_on_book',
                                                   'Total_Relationship_Count',
                                                   'Months_Inactive_12_mon',
                                                   'Contact...
                               feature_types=None, gamma=0, gpu_id=None,
                               grow_policy=None, importance_type=None,
                               interaction_constraints=None, learning_rate=0.01,
                               max_bin=None, max_cat_threshold=None,
                               max_cat_to_onehot=None, max_delta_step=None,
                               max_depth=1, max_leaves=None,
                               min_child_weight=None, missing=nan,
                               monotone_constraints=None, n_estimators=150,
                               n_jobs=None, num_parallel_tree=None,
                               predictor=None, random_state=1, ...))])
In [121]:
# Calculating different metrics on test set
XGBoost_ROS_tuned_pipeline_perf = model_performance_classification_sklearn(
    churn_pipe, X_test, y_test
)
print("Test performance:")
XGBoost_ROS_tuned_pipeline_perf
Test performance:
Out[121]:
Accuracy Recall Precision F1
0 0.161 1.000 0.161 0.277

Business Insights and Recommendations¶

Key Insights from EDA¶

From the Exploratory Data Analysis done, the following observations and key insights became evident:

  1. The perfect positive linear relationship between Avg_Open_To_Buy and Credit_Limit suggests that Customers are not using their cards. On further analysis it was shown that:

    ◎Only a small fraction of customers (0.3%) are active every month.

    ◎About 45.3% of customers have not used their cards for 3 months and over, in last 12 months


  2. The more products a Customer subscribes to, the less likely it is for the Customer to Attrite.

  3. It's interesting to note that the period of relationship with the bank (Months_on_book) does not appear to affect Attrition. Also, how old a customer is (Customer_Age) does not appear to affect Attrition.

  4. Attrited Customers generally have a low Total_Revolving_Bal (the balance that carries over from one month to the next), low Avg_Utilization_Ratio (how much of the available credit the customer spent) & low Total_Trans_Ct (total transaction count (Last 12 months)).

Key Insights from the BEST Model¶

From our BEST of 30 models, Tuned XGBoost on RandomOverSampled dataset, the following observations and key insights became evident:

  1. From a business perspective, Recall is the best measure for the desired Predictive Model. The Tuned XGBoost on RandomOverSampled dataset gives us Best RECALL Score of 1.0 (100%) with lowest Bias and lowest Variance.

  2. This Recall score of 1.0 gives a confidence of greater than 95% Confidence Interval.

    • This means that everytime the model is called upon and makes a prediction, we are confident that the model will make a correct prediction over 95% of the time.
  3. From the Feature Importances of our BEST model, the 2 most important features are Total_Trans_Ct and Total_Revolving_Bal.

Recomendations to the Business¶

Based on the Predictive Models, Analysis, Insights & Observations...here are specific business recommendations:

  1. *The Thera Bank should curate targeted campaigns to increase the following key metrics:*

    ◎ The Total Transaction Count of customers.

    ◎ The Total Revolving Balance: the balance that carries over from one month to the next.

    ◎ The Utilization_Ratio: how much of the available credit the customer spent.


  2. More Products should be offered to customers (particularly customers utilizing few of the Bank's products). This is because the more products a Customer subscribes to, the less likely it is for the Customer to Attrite.

  3. The business should create a LOYALTY Program that will reward customers for using their credit cards. This will drive down Avg_Open_To_Buy (the amount left on the credit card to use) per Credit_Limit

  4. Specific promotional and rewards program can be created for different income groups.

  5. Customers with Income more that $120K should also be focused upon to enhance their satisfaction to reduce the attrition.

Further Analysis and Modeling:

  1. It is hereby recommended that further data collection should be done to capture additional information like customer satisfaction and analysis will be done to pinpoint other specific reasons for Attrition.

  2. The data collection process can be enhanced to capture additional information related to Customer Employment, Living Status and Spending Pattern.

    ◎ Further analysis & modelling can to be done to unearth invaluable information.